lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <>
Subject Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?
Date Mon, 12 May 2008 14:03:24 GMT

I need to find a reliable way how to extract content out of Word, Excel and
PowerPoint formats prior to indexing and I am not sure if POI is the best
way to go. Can anybody share experience with POI and/or other [commercial]
Java library for text extraction from MS formats?

My experience with POI is such that sometimes it can be a pain to get the
content out of the MS files properly. I also know that Nutch plugin uses POI
for MS formats but as far as I remember it is not 100% reliable (my more
then one year old experience is that about 1-2% of files were not parsed).

My requirements are that the text extraction software must run on Linux and
should be written in Java, it can be open source or commercial library.



  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message