lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Burch <n...@torchbox.com>
Subject Re: Can POI provide reliable text extraction results for production search engine for Word, Excel and PowerPoint formats?
Date Mon, 12 May 2008 15:58:21 GMT
On Mon, 12 May 2008, Lukas Vlcek wrote:
> I need to find a reliable way how to extract content out of Word, Excel 
> and PowerPoint formats prior to indexing and I am not sure if POI is the 
> best way to go. Can anybody share experience with POI and/or other 
> [commercial] Java library for text extraction from MS formats?

We use poi for text extraction, and it works just fine for us. POI 3.1 
should offer a few improvements on text extraction, and POI 3.5 will give 
you OOXML text extraction too.

You might also like to take a look at Apache Tika 
<http://incubator.apache.org/tika/>. It wraps up POI (and a few other 
document extractor libraries), giving you a simple, common interface for 
text extraction

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message