pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhang, Lisheng" <Lisheng.Zh...@BroadVision.com>
Subject getText() performance in PDFBox 1.5 release
Date Fri, 04 Nov 2011 16:23:49 GMT
Hi,
 
I have been usiing PDFBox to extract text from PDF files for full text search for a few years,
and found it is a great product. Recently I downloaded PDFBox 1.5 and found that it can 
extract text from many PDF files which cannot be processed previously, thanks!!
 
The problem I have is that it took long time for PDFTextStripper.getText(..) to finish, for
example
our client has a 27MB PDF file which contains some graphics, it took getText(..) 50m to finish
even though it only extract 100K text eventually.
 
I tried to change input parameters and results are same essentially, I would like to know
if this
speed is expected and the possibility to improve?
 
Thanks very much for helps, Lisheng 
Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message