lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lukas Vlcek" <lukas.vl...@gmail.com>
Subject Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?
Date Mon, 12 May 2008 14:03:24 GMT
Hi,

I need to find a reliable way how to extract content out of Word, Excel and
PowerPoint formats prior to indexing and I am not sure if POI is the best
way to go. Can anybody share experience with POI and/or other [commercial]
Java library for text extraction from MS formats?

My experience with POI is such that sometimes it can be a pain to get the
content out of the MS files properly. I also know that Nutch plugin uses POI
for MS formats but as far as I remember it is not 100% reliable (my more
then one year old experience is that about 1-2% of files were not parsed).

My requirements are that the text extraction software must run on Linux and
should be written in Java, it can be open source or commercial library.

Regards,
Lukas

-- 
http://blog.lukas-vlcek.com/

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message