lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chas Emerick <cemer...@snowtide.com>
Subject Re: Existing Parsers
Date Thu, 09 Sep 2004 14:04:47 GMT
There are a number of libraries for Java that provide PDF text 
extraction functionality.  A pretty comprehensive list is available at 
< http://www.geocities.com/marcoschmidt.geo/java-libraries-pdf.html >.  
I'm obviously biased towards recommending our solution, PDFTextStream < 
http://snowtide.com/home/PDFTextStream/ >; it's the fastest thing out 
there for Java, and it provides a very easy-to-use Lucene integration 
module that will have you up and running in no time < 
http://snowtide.com/home/PDFTextStream/techtips/easy_lucene_integration 
 >.

For office documents, just about the only game in town that I know of 
is the Jakarta POI project < http://jakarta.apache.org/poi/ >.  It's 
been quite a while since I've touched it, but it's definitely the best 
place to start.

Chas Emerick   |   cemerick@snowtide.com

PDFTextStream: fast PDF text extraction for Java apps and Lucene
http://snowtide.com/home/PDFTextStream/

On Sep 9, 2004, at 9:47 AM, <dhatcher@webtads.com> wrote:

> Anyone know of any reliable parsers out there for pdf word
> excel or powerpoint?


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message