lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?
Date Tue, 13 May 2008 13:41:55 GMT
I've used POI, as well as commercial providers.  As always, it  
depends :-)  I wasn't particularly impressed with the commercial  
providers given the amount of money they wanted for it.   PDF was  
particularly tricky, but you weren't asking about that.   At least w/  
POI, you have the opportunity to fix things that don't work based on  
your priorities.  I don't know what the failure rate is for the  
commercial providers, but my experience is they will all fail at least  
once, so you better plan on it.  I'd look to use a framework like Tika  
or Aperture, where you can easily upgrade or plug in new or different  
libraries (including commercial providers) as needed w/o rewriting  
your code.  Additionally, with something like Tika or Aperture, you  
could easily mix and match your solutions, such that you use one for  
Word and a different one for PPT or PDF.

One issue with any of them is how you plan to use them.  If you need  
more than bag of words, they all get less reliable, especially when it  
comes to PDFs and Office docs.  Dealing with things like tables,  
columns, captions, labels, etc. has always been problematic in my  
experience when one wants to do higher level processing (beyond  
keyword search).


On May 12, 2008, at 10:03 AM, Lukas Vlcek wrote:

> Hi,
> I need to find a reliable way how to extract content out of Word,  
> Excel and
> PowerPoint formats prior to indexing and I am not sure if POI is the  
> best
> way to go. Can anybody share experience with POI and/or other  
> [commercial]
> Java library for text extraction from MS formats?
> My experience with POI is such that sometimes it can be a pain to  
> get the
> content out of the MS files properly. I also know that Nutch plugin  
> uses POI
> for MS formats but as far as I remember it is not 100% reliable (my  
> more
> then one year old experience is that about 1-2% of files were not  
> parsed).
> My requirements are that the text extraction software must run on  
> Linux and
> should be written in Java, it can be open source or commercial  
> library.
> Regards,
> Lukas
> -- 

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message