pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: fwd: A Benchmark and Evaluation for Text Extraction from PDF
Date Wed, 19 Jul 2017 17:01:06 GMT
I have received the two ERR files and one NL- files from Claudius 
Korzen. I have uploaded them at
http://www.filedropper.com/pdfboxerror1
http://www.filedropper.com/pdfboxerror2
http://www.filedropper.com/examplenlminus

About the ERR files: these are indeed bad. The content streams are bad, 
probably a bug in the creator software.

About the NL- file: he wrote that per their test design decision, the 
two formulas in "Lemma 3.3" on page 13 should be different paragraphs 
because they have different semantic roles than the body text.

I'm neutral about this... IMHO extracting formulas form a PDF is useless 
because one will never get an exact copy due to the two-dimensionality 
(subscript and superscript) of them.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message