pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: Comparing extracted text with pdftotext
Date Thu, 29 Nov 2018 19:05:54 GMT
commoncrawl3/IK/IKIMQ2USV4HEF2NF4K7UNMZZFADCKVWP
the missing part in PDFBox is a diagonal text

commoncrawl3/2L/2LBSIRE27J5TTKH53KR6PEZH6QKJ3BZ7
the missing words are separated with a "-" at the end of the line. 
Interesting feature.

Tilman

Am 26.11.2018 um 21:49 schrieb Tim Allison:
> All,
>
>    I just finished drafting a high level "lab report" comparing
> pdftotext and Tika/PDFBox on the PDFs in our refreshed regression
> corpus: https://wiki.apache.org/tika/ComparisonTikaAndPDFToText201811.
> The more interesting bits are in the actual reports from tika-eval
> and/or the comparison database available here:
> http://162.242.228.174/pdf_parsing/pdftotextVPDFBox_201811/
>
>    Let me know what you think.
>
>            Cheers,
>
>                     Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message