pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maruan Sahyoun <sahy...@fileaffairs.de>
Subject Re: Problems getting the height of text in v2?
Date Mon, 26 Oct 2015 07:33:26 GMT

> Am 26.10.2015 um 06:36 schrieb Joel Hirsh <joelehirsh@gmail.com>:
> I am trying to get the size of text (i.e fontsize).  In version 1.8, the
> height of text was somewhat inconsistent, and not there for type 3 fonts,
> but I thought that was supposed to be all sorted out in v2.0.  But version
> 2 seems to be even more inconsistent than version 1.8.
> I am using PDFTextStripper and reading the TextPosition array that comes
> with each String.  I have tried getHeight(), getFontSize(),
> getFontSizeInPt(), getYScale, and none of them are dependable for a useful
> answer.  They are consistent within a file, but useless for checking if a
> particular string contains readable size text.

maybe take a look at PrintTextLocations.java in the examples package. This should allow you
to compare the output of the 1.8.x version to the 2.0.0 version.
> Which one of these TextPosition values should be used for this purpose
> And then do I report bugs on all the files that don't give correct results?

If there are differences between 1.8.x and 2.0.0 yes please open an issue in https://issues.apache.org/jira/browse/PDFBOX/.

Please look if there are already similar issues which you could add to. We are currently working
together with Apache TIKA to look at potential regressions in 2.0.0 compared to 1.8.x and
there were already some issues created and fixed  created which you can follow at https://issues.apache.org/jira/browse/PDFBOX-3058.


> FYI - I ran a test with version 2 against 100+ PDF files that come from
> different sources, and use a mixture of TrueType, Type 0, Type1, Type3
> fonts.  All of these have text that is font size 8-12pt, as reported by
> Acrobat.  I dumped the size values returned for digit strings in the files
> (i.e 12345), so that everything should be a full height string.
> The reported height of text mostly ranged from 2.3 to 7.5 (although one
> very readable file reported a height of 0).  I examined a few files with
> Acrobat and the files with reported text height of 2.3  and 7.5 both had
> 9pt fonts.  But the other values from TextPosition were worse. The fontsize
> was a plausible value for only about half of these files, seemed
> particularly bad on TrueTypeFont's.  The fontsize values ranged from 1 to
> 200.  The fontsizeinpt values seemed mostly to be a multiple of fontsize,
> but even that was inconsistent, often it seems to be the square of the
> fontsize (like a fontsize of 58 and a fontsizeinpt of 3364), but sometimes
> simply a multiple of 10.
> The most accurate value I could find in the TextPosition was getYScale(),
> which had a plausible value about 90% of the time.  But on type3 fonts, it
> too was inconsistent, often returning values of 1, but also values up to 27.
> So how should I be finding out the height of text??

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message