lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Fisheye <des...@gmx.ch>
Subject Lucene - FileFormat
Date Fri, 21 Apr 2006 11:23:51 GMT

Im trying to construct a plaintext parser for different file formats like ms
word, excel, powerpoint, rich text format, plain text, html, pdf etc.

I use the known libraries PDFBox, POI and some parts from AtLeap...and now I
should support the OpenOffice formats and the more important msg-fromat (MS
outlook message format).

Does someone know how I can simply (like POI) extract plaint text from msg?
Probably there exists an open source library like for pdf or ms office
files?

I need the plain text because the only way for me seems to extract all the
plain text from every single document, and then add it to my lucene
index...this is necessary to get the best excerpt from highlighter...

Thx

Simon Dietschi
--
View this message in context: http://www.nabble.com/Lucene---FileFormat-t1485959.html#a4024568
Sent from the Lucene - Java Users forum at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message