lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kristian Hermsdorf <kristian.hermsd...@ifbus.de>
Subject Re: Use an executable from java ...
Date Tue, 08 Feb 2005 09:22:22 GMT
Hi Christiaan

> Just to defend PDFBox: we actually recently decided to move in the
> opposite direction.

I didn't want to offend PDFBox *g*

> We just removed pdftotext from our application and are now using PDFBox
> 0.7.0 for all our PDF processing. Before we were using them both in
> parallel: pdftotext for fast text extraction and PDFBox for all metadata
> such as titles, authors, etc.

pdftotext is able to produce html output which contains these metadata as well.
Conversion from pdf to html and parsing html is (with our tests) still twice as fast as PDFBox.

> Upon closer inspection of the output, we also saw that pdftotext was not
> able to extract text from a significant amount of PDFs (9 out of 113
> documents, all perfectly readable PDF documents) while PDFBox performed
> flawlessly. For us, quality is of greater concern than speed.

That's courious beacause we experienced that pdftotext was able to convert 33% more pdf documents
than PDFBox.

> Finally, I must say that the speed and quality of Ben's replies to bug
> reports and suggestions is very impressive, giving us confidence in that
> future problems will be handled satisfactorily.

That's good. Out application supports alternative conversion pipelines that provide fallback
mechanims. If the first converter cannot convert a document a second converter is called.
So PDFBox is our fallback converter.

Greetings
Kristian

-- 
ACRONYM: Acronym Causing Recursion, Obviously Numbing Your Mind  

Kristian Hermsdorf

Interface Projects GmbH
Tolkewitzer Straße  49		
01277 Dresden			


tel.: ++49-351-3 18 09 39

mail: Kristian.Hermsdorf@interface-business.de
priv: kristian@entropus.de


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message