lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Patramanskij <...@osua.de>
Subject Re: PDF->Text Performance comparison
Date Thu, 09 Sep 2004 08:04:58 GMT
Hello Ben,

I've been using PDFBox within last year, but only version 0.6.3,
because of 2 reasons:

 1) I tried to migrate to never versions(o.6.4, 0.6.5, 0.6.6), but all the time I had
 problems with parsing the same pdf documents, which worked well for
 0.6.3. I mentioned my problems here:
  https://sourceforge.net/tracker/?func=detail&atid=552832&aid=1021691&group_id=78314

 2) When I were started with 0.6.3 I experienced perfomance problems
 too, especially with large pdf documents (I had several with more
 then 20MB size). I changed a bit source, wrapping the following line
 of BaseParser class:

            out = stream.createFilteredStream( streamLength );

            to
            
            out = new BufferedOutputStream(stream.createFilteredStream( streamLength ));
            

 The performance increase, I've got, was huge:
 parsing 21MB pdf document to text before modifacatrion was taking 78
 seconds, after modification 12 seconds, so more the 6 times faster.

 I tried also to use buffered streams in some other places, but it was
 not that visible. I hope this change can also be incorporated into
 the current 0.6.6 release and then benchmarks may stay in PDFBox side
 :)


 Max


BL> On Wed, 8 Sep 2004, Chas Emerick wrote:
>> PDFTextStream: fast PDF text extraction for Java applications
>> http://snowtide.com/home/PDFTextStream/


BL> For those that have not seen, snowtide.com has done a performance
BL> comparison against several Java PDF->Text libraries, including Snowtide's
BL> PDFTextStream, PDFBox, Etymon PJ and JPedal.  It appears to be fairly well
BL> done.

BL> http://snowtide.com/home/PDFTextStream/Performance


BL> PDFBox: slow PDF text extraction for Java applications
BL> http://www.pdfbox.org

BL> :)

BL> Ben


BL> ---------------------------------------------------------------------
BL> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
BL> For additional commands, e-mail: lucene-user-help@jakarta.apache.org




-- 
Best regards,
 Maxim                            mailto:max@osua.de


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message