lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christiaan Fluit <>
Subject Re: Use an executable from java ...
Date Tue, 08 Feb 2005 08:41:40 GMT
Kristian Hermsdorf wrote:

> We're using pdftotext as well, because PDFbox ist really slow. If your 
> application should work under Windows you will probably experiance some 
> mystic Java-VM crashes while executing external processes in batch-mode. 
> (This is because of a bug in Windows-VM... we implemented out own 
> Process with JNI to compensate this bug).

Just to defend PDFBox: we actually recently decided to move in the 
opposite direction.

We just removed pdftotext from our application and are now using PDFBox 
0.7.0 for all our PDF processing. Before we were using them both in 
parallel: pdftotext for fast text extraction and PDFBox for all metadata 
such as titles, authors, etc.

One reason for this is that with version 0.7.0 the difference in 
performance was only marginal on our testset of 113 PDF documents from 
various sources. Of course the difference will be bigger when you are 
only extracting text, because in the old situation we had to let two 
tools process the same file.

Upon closer inspection of the output, we also saw that pdftotext was not 
able to extract text from a significant amount of PDFs (9 out of 113 
documents, all perfectly readable PDF documents) while PDFBox performed 
flawlessly. For us, quality is of greater concern than speed.

Finally, I must say that the speed and quality of Ben's replies to bug 
reports and suggestions is very impressive, giving us confidence in that 
future problems will be handled satisfactorily.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message