pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "info@lehmi.de" <i...@lehmi.de>
Subject Re: Font size and text height in PDFBox 0.8.0
Date Fri, 20 Feb 2009 10:03:11 GMT
Hi Andreas

> I'm using PDFBox to extract text, bounding boxes, and font information
> from PDF files from a variety of sources. Mostly in files with Type 3
> fonts but also in others org.apache.pdfbox.util.TextPosition does not
> return the correct information. In those cases, getHeight returns 0
> and getFontSize returns 1 (the latter happens much more frequently).
> PDFBox 0.8.0 (from the svn truck) addresses the issue for about one
> third of the documents that had problems in PDFBox 0.7.3.  Here is an
> example of a document that it especially bad. PDFont also does not
> have any base font information, maybe because of the Type 3 fonts.
The problem is the way some pdf-generators produces their documents.
There is the pdf command Tj to set the font size directly and that is
the result you see using TextPositon.getFontSize(). But in many cases
the font size is set to default size 1 and it is scaled to the real size
through the textmatrix. PDFBox reads and uses both to draw the string
with the right scaling. So every time the expected result is the same,
wether the pdf-doucment uses Tf = 12 and Tm = 1 or the other way round
Tf = 1 and Tm = 12.
I'll extend the TextPosition to get the size as a combination of the
fontsize and the scaling.

> P.S.: For a few documents, I ran into a different font related issue
> (see stack trace below). I added a print statement to determine the
> values that cause the problem.
Can you provide us an example for this issue?

Andreas Lehmkühler

View raw message