lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Litchfield <...@csh.rit.edu>
Subject Re: Exotic format indexing?
Date Thu, 30 Oct 2003 19:48:41 GMT
Unfortunately, it is not quite so easy.  I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
"fishing" around for text would be pointless.  Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.

Ben


On Thu, 30 Oct 2003, petite_abeille wrote:

> Hello,
>
> Indexing a multitude of esoteric formats (MS Office, PDF, etc) is a
> popular question on this list...
>
> The traditional approach seems to be to try to find some kind of format
> specific reader to properly extract the textual part of such documents
> for indexing. The drawback of such an approach is that its complicated
> and cumborsome: many different formats, not that many Java libraries to
> understand them all.
>
> An alternative to such a mess could be perhaps to convert those
> multitude of formats into something more or less standard and then
> extract the text from that. But again, this doesn't seem to be such a
> straightforward proposition. For example, one could image "printing"
> every document to PDF and then convert the resulting PDF to text. Not a
> piece of cake in Java.
>
> Finally, a while back, somebody on this list mentioned quiet a
> different approach: simply read the raw binary document and go fishing
> for what looks like text. I would like to try that :)
>
> Does anyone remember this proposal? Has anyone tried such an approach?
>
> Thanks for any pointers.
>
> Cheers,
>
> PA.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message