lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Goldenberg" <>
Subject RE: Lucene - FileFormat
Date Fri, 21 Apr 2006 16:08:52 GMT
I wonder if using Zoe might do the trick -
Have you tried it?
- Dmitry


From: Fisheye []
Sent: Fri 4/21/2006 7:23 AM
Subject: Lucene - FileFormat

Im trying to construct a plaintext parser for different file formats like ms
word, excel, powerpoint, rich text format, plain text, html, pdf etc.

I use the known libraries PDFBox, POI and some parts from AtLeap...and now I
should support the OpenOffice formats and the more important msg-fromat (MS
outlook message format).

Does someone know how I can simply (like POI) extract plaint text from msg?
Probably there exists an open source library like for pdf or ms office

I need the plain text because the only way for me seems to extract all the
plain text from every single document, and then add it to my lucene
index...this is necessary to get the best excerpt from highlighter...


Simon Dietschi
View this message in context:
Sent from the Lucene - Java Users forum at

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message