lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?
Date Tue, 13 May 2008 14:13:15 GMT
Grant Ingersoll wrote:
> I've used POI, as well as commercial providers.  As always, it depends 
> :-)  I wasn't particularly impressed with the commercial providers given 
> the amount of money they wanted for it.   PDF was particularly tricky, 
> but you weren't asking about that.   At least w/ POI, you have the 
> opportunity to fix things that don't work based on your priorities.  I 
> don't know what the failure rate is for the commercial providers, but my 
> experience is they will all fail at least once, so you better plan on 
> it.  I'd look to use a framework like Tika or Aperture, where you can 
> easily upgrade or plug in new or different libraries (including 
> commercial providers) as needed w/o rewriting your code.  Additionally, 
> with something like Tika or Aperture, you could easily mix and match 
> your solutions, such that you use one for Word and a different one for 
> PPT or PDF.
> One issue with any of them is how you plan to use them.  If you need 
> more than bag of words, they all get less reliable, especially when it 
> comes to PDFs and Office docs.  Dealing with things like tables, 
> columns, captions, labels, etc. has always been problematic in my 
> experience when one wants to do higher level processing (beyond keyword 
> search).

Yet another option ... In the past I used a licensed copy of MS Office 
to extract things that I wanted, using a bit of OLE automation and 
VBscript. Worked reasonably well, in the sense that I had no issues 
whatsoever with extracting the content _and_ formatting from any 
documents that could be normally opened with MS Office - however, 
performance was an issue, ie. it was slow, CPU/memory hog, and 
occasionally it would get stuck in a weird state when only complete 
reboot would help.

Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message