lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chong, Herb" <HCho...@bloomberg.com>
Subject RE: Exotic format indexing?
Date Thu, 30 Oct 2003 20:05:33 GMT
Word documents with FastSave enabled contain the original document and then deltas to the document
until the deltas exceed a certain size and then they are merged back into the document. that
means that unless you run the deltas, you won't know what the actual final contents are.

Herb....

-----Original Message-----
From: Ben Litchfield [mailto:ben@csh.rit.edu]
Sent: Thursday, October 30, 2003 2:49 PM
To: Lucene Users List
Subject: Re: Exotic format indexing?


Unfortunately, it is not quite so easy.  I am not sure about Word
documents but PDFs usually have there contents compressed so a raw
"fishing" around for text would be pointless.  Your best bet is to use a
package like the one from textmining.org that handles various formats for
you.

Ben

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message