lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject 182 file formats for lucene!!! was: Re: Exotic format indexing?
Date Thu, 30 Oct 2003 20:02:42 GMT
Hi there,

just to let you know, i had implement for the nutch project a plugin 
that can parse 182 file formats including m$ office.
I simply use open office and use the available java api.

It is really straight forward to use.

Found some info's and a link to the open source code here:
http://sourceforge.net/tracker/index.php?func=detail&aid=828517&group_id=59548&atid=491356

Feel free to recycle the code and give me any feedback.
Hope it will help to free some information from some strange commercial 
formats, since information should be free. ;)

Cheers
Stefan







Ben Litchfield wrote:

>Unfortunately, it is not quite so easy.  I am not sure about Word
>documents but PDFs usually have there contents compressed so a raw
>"fishing" around for text would be pointless.  Your best bet is to use a
>package like the one from textmining.org that handles various formats for
>you.
>
>Ben
>
>
>On Thu, 30 Oct 2003, petite_abeille wrote:
>
>  
>
>>Hello,
>>
>>Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
>>popular question on this list...
>>
>>The traditional approach seems to be to try to find some kind of format
>>specific reader to properly extract the textual part of such documents
>>for indexing. The drawback of such an approach is that its complicated
>>and cumborsome: many different formats, not that many Java libraries to
>>understand them all.
>>
>>An alternative to such a mess could be perhaps to convert those
>>multitude of formats into something more or less standard and then
>>extract the text from that. But again, this doesn't seem to be such a
>>straightforward proposition. For example, one could image "printing"
>>every document to PDF and then convert the resulting PDF to text. Not a
>>piece of cake in Java.
>>
>>Finally, a while back, somebody on this list mentioned quiet a
>>different approach: simply read the raw binary document and go fishing
>>for what looks like text. I would like to try that :)
>>
>>Does anyone remember this proposal? Has anyone tried such an approach?
>>
>>Thanks for any pointers.
>>
>>Cheers,
>>
>>PA.
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>    
>>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message