pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Problems getting the height of text in v2?
Date Wed, 28 Oct 2015 00:42:18 GMT

> On 25 Oct 2015, at 22:36, Joel Hirsh <joelehirsh@gmail.com> wrote:
> I am trying to get the size of text (i.e fontsize).  In version 1.8, the
> height of text was somewhat inconsistent, and not there for type 3 fonts,
> but I thought that was supposed to be all sorted out in v2.0.  But version
> 2 seems to be even more inconsistent than version 1.8.
> I am using PDFTextStripper and reading the TextPosition array that comes
> with each String.  I have tried getHeight(), getFontSize(),
> getFontSizeInPt(), getYScale, and none of them are dependable for a useful
> answer.  They are consistent within a file, but useless for checking if a
> particular string contains readable size text.
> Which one of these TextPosition values should be used for this purpose
> And then do I report bugs on all the files that don't give correct results?
> FYI - I ran a test with version 2 against 100+ PDF files that come from
> different sources, and use a mixture of TrueType, Type 0, Type1, Type3
> fonts.  All of these have text that is font size 8-12pt, as reported by
> Acrobat.  I dumped the size values returned for digit strings in the files
> (i.e 12345), so that everything should be a full height string.
> The reported height of text mostly ranged from 2.3 to 7.5 (although one
> very readable file reported a height of 0).  I examined a few files with
> Acrobat and the files with reported text height of 2.3  and 7.5 both had
> 9pt fonts.  But the other values from TextPosition were worse. The fontsize
> was a plausible value for only about half of these files, seemed
> particularly bad on TrueTypeFont's.  The fontsize values ranged from 1 to
> 200.  The fontsizeinpt values seemed mostly to be a multiple of fontsize,
> but even that was inconsistent, often it seems to be the square of the
> fontsize (like a fontsize of 58 and a fontsizeinpt of 3364), but sometimes
> simply a multiple of 10.
> The most accurate value I could find in the TextPosition was getYScale(),
> which had a plausible value about 90% of the time.  But on type3 fonts, it
> too was inconsistent, often returning values of 1, but also values up to 27.
> So how should I be finding out the height of text??

You’re right that these methods are inconsistent. You might expect that
PDFBox would be returning the dimensions of a given glyph or string’s
bounding box from those methods, however that’s not the case. What’s
actually returned from getWidth() is the *logical* width of the glyph, i.e.
it’s advance width, not it’s visual width. That’s pretty normal an is fine
for most use cases but what’s not normal is that there’s a getHeight()
method, as there’s no such thing as the logical height of a glyph, because
it’s always equal to the font size, regardless of the glyph.

So what does Font.getHeight() do? Well, it’s not pretty; sometimes it returns
the visual height of the glyph, other times it returns the y-advance (even
though that’s zero unless it’s a vertical font). Sometimes it returns values
in text space, other times in glyph space. We should probably just remove
this method as it really serves no purpose, but somewhere in the 2000 odd
lines of PDFTextStripper are some assumptions which depend on it and
I for one have no intention of entering that labyrinth.

Actually it gets worse, PDFTextStripper depends on several incorrect
calculations of the text rendering matrix and other values, which, when fixed
caused PDFTextStripper to break. As a workaround PDFTextStreamEngine
was created which overrides showGlyph and replaces the perfect calculations
of PDFStreamEngine with the incorrect calculations on which PDFTextStripper
depends. There are some fun assumptions in there, such as using 1/2 the
font’s (yes font, not glyph) bounding box as the current glyph’s height, which
is quite meaningless.

Those interested in fixing PDFTextStripper may want to start by removing
the legacy calculations from PDFTextStreamEngine and removing the
PDFont.getHeight() method entirely. They way then want to consider
whether or not to use visual bounds or logical bounds when computing
glyph properties. (Logical is simpler, faster, and probably fine). I wish
those people good luck!

The good news is that PDFont.getWidth() and PDFStreamEngine perform
their calculations correctly. Hence we get correct text rendering, even if
text extraction is incorrect. So the problems are contained solely in 
PDFTextStripper, PDFTextStreamEngine, and TextPosition.

— John

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message