lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Patramanskij <>
Subject Re: PDF->Text Performance comparison
Date Thu, 09 Sep 2004 08:04:58 GMT
Hello Ben,

I've been using PDFBox within last year, but only version 0.6.3,
because of 2 reasons:

 1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
 problems with parsing the same pdf documents, which worked well for
 0.6.3. I mentioned my problems here:

 2) When I were started with 0.6.3 I experienced perfomance problems
 too, especially with large pdf documents (I had several with more
 then 20MB size). I changed a bit source, wrapping the following line
 of BaseParser class:

            out = stream.createFilteredStream( streamLength );

            out = new BufferedOutputStream(stream.createFilteredStream( streamLength ));

 The performance increase, I've got, was huge:
 parsing 21MB pdf document to text before modifacatrion was taking 78
 seconds, after modification 12 seconds, so more the 6 times faster.

 I tried also to use buffered streams in some other places, but it was
 not that visible. I hope this change can also be incorporated into
 the current 0.6.6 release and then benchmarks may stay in PDFBox side


BL> On Wed, 8 Sep 2004, Chas Emerick wrote:
>> PDFTextStream: fast PDF text extraction for Java applications

BL> For those that have not seen, has done a performance
BL> comparison against several Java PDF->Text libraries, including Snowtide's
BL> PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly well
BL> done.


BL> PDFBox: slow PDF text extraction for Java applications

BL> :)

BL> Ben

BL> ---------------------------------------------------------------------
BL> To unsubscribe, e-mail:
BL> For additional commands, e-mail:

Best regards,

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message