pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tilman Hausherr <THaush...@t-online.de>
Subject Re: PDFbox appears to struggle with text extraction for some fonts.
Date Wed, 16 Nov 2016 18:09:06 GMT
Am 16.11.2016 um 18:47 schrieb John Logan:
> Hi,
>
> I've been using PDFbox to extract text features for layout analysis, and I'm running
into a file that seems render properly, but the extracted text looks totally botched.  If
I copy/paste from Acrobat Reader or Mac Preview, the same glyphs are broken.

Yes.

Have a look here:
Root/Pages/Kids/[0]/Resources/Font/Ty7

then scroll down and look at the "unicode" column. It is empty.

You have to understand the difference between "glyph" and "character". A 
glyph is just a painting of a character. If you see a "9" then it 
doesn't have to be that you get a "9" in text extraction too, this must 
be defined somewhere. And if it isn't, or is incorrect, then you won't 
get a good extraction.

Tilman

>
> I've tried to make sense of the PDF using the debugger, but this is a bit beyond my (limited)
PDF internals knowledge.  My guess is that the PDF file has some problems with the subsetted
"BerlingskeSerifText-Extralight*2" font (this appears to be the font used in the example I
provide below), but I can't determine why the problem glyphs appear fine inside a PDF viewer
whereas the extracted text is incorrect.
>
> Thanks for any guidance you can provide!  I've included a sample file and details below.
>
> John
>
> I've uploaded the PDF for a problem page here:
>
> https://www.dropbox.com/s/05rlbmv74ya0lrg/TVL_2016_12-64.pdf?dl=0
>
> The phrase "comfortable Airbus A XWB to Helsinki and suffering zero jet lag" on this
page has problems with the numbers in "A350" and the ligature in "suffering".
>
> If I use the PDFbox preflight app, I see three error classes:
>
> 1.0.14 : Syntax error, Object {67:0} has an offset of 0
> 3.1.4 : Invalid Font definition, UDWCAS+BerlingskeSerifCn-XBd: The Charset entry is missing
for the Type1 Subset
> 1.2.7 : Body Syntax error, Filter specified in metadata dictionnary
>
> The PDF debugger dump of this part of the content is:
>
> q
>      1 0 0 1 99.60001 123.131 cm
>      BT
>        8.5 0 0 8.5 0 0 Tm
>        /Ty5 1 Tf
>        [ (c) 10 (omfort) -9.9 (able ) -24 (Airb) 5.1 (us ) -24 (A) ] TJ
>      ET
>    Q
>    q
>      1 0 0 1 99.60001 123.131 cm
>      BT
>        8.5 0 0 8.5 81.1988 0 Tm
>        /Ty7 1 Tf
>        [ ($%) 10 (&) ] TJ
>      ET
>    Q
>    q
>      1 0 0 1 99.60001 123.131 cm
>      BT
>        8.5 0 0 8.5 94.5778 0 Tm
>        /Ty5 1 Tf
>        [ ( ) -24 (XWB ) -24 ( ) -24 (to ) -24 (Helsinki ) -24 (and ) -24 (su) ] TJ
>      ET
>    Q
>    q
>      1 0 0 1 99.60001 123.131 cm
>      BT
>        8.5 0 0 8.5 186.9813 0 Tm
>        /Ty7 1 Tf
>        (') Tj
>      ET
>    Q
>    q
>      1 0 0 1 99.60001 123.131 cm
>      BT
>        8.5 0 0 8.5 192.0218 0 Tm
>        /Ty5 1 Tf
>        [ (ering ) -24 (z) 5 (er) 10 (o ) -24 (jet ) -24 (lag, ) -24 (t) -5 (ra) 10 (v)
10 (el ) -24 (is ) -24 (g) 5 (ett) -5 (ing ) -24 (undeniably ) -24 (better) 20 (. ) ] TJ
>      ET
>    Q
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Mime
View raw message