lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chas Emerick <>
Subject Re: Existing Parsers
Date Thu, 09 Sep 2004 14:04:47 GMT
There are a number of libraries for Java that provide PDF text 
extraction functionality.  A pretty comprehensive list is available at 
< >.  
I'm obviously biased towards recommending our solution, PDFTextStream < >; it's the fastest thing out 
there for Java, and it provides a very easy-to-use Lucene integration 
module that will have you up and running in no time < 

For office documents, just about the only game in town that I know of 
is the Jakarta POI project < >.  It's 
been quite a while since I've touched it, but it's definitely the best 
place to start.

Chas Emerick   |

PDFTextStream: fast PDF text extraction for Java apps and Lucene

On Sep 9, 2004, at 9:47 AM, <> wrote:

> Anyone know of any reliable parsers out there for pdf word
> excel or powerpoint?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message