lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <>
Subject Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?
Date Mon, 12 May 2008 15:58:21 GMT
On Mon, 12 May 2008, Lukas Vlcek wrote:
> I need to find a reliable way how to extract content out of Word, Excel 
> and PowerPoint formats prior to indexing and I am not sure if POI is the 
> best way to go. Can anybody share experience with POI and/or other 
> [commercial] Java library for text extraction from MS formats?

We use poi for text extraction, and it works just fine for us. POI 3.1 
should offer a few improvements on text extraction, and POI 3.5 will give 
you OOXML text extraction too.

You might also like to take a look at Apache Tika 
<>. It wraps up POI (and a few other 
document extractor libraries), giving you a simple, common interface for 
text extraction


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message