pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Character widths in fonts
Date Thu, 20 Nov 2014 20:35:28 GMT
Hi Peter,

a) Paragraphs 1 and 2 use the same font, the Type 0 / CIDType2 font
     “ABCDEE+Calibri” but paragraph 3 uses a TrueType font with the same
     name. Both fonts have valid widths specified in their FontDescriptors.

b) Yes, I think that it causes issues with printing, but I’ve encountered such
     PDFs in the past. The name of a font is never actually used anywhere
     inside the PDF document, so it doesn’t matter from a practical point of
     view - a fonts is always identified by a page-unique name in the current
     page’s resource dictionary, not via it’s PostScript name.

c) It tries to find the width of character 32, and if it is not available it uses
    the average width of all characters - it’s not ideal by any means.

d) Have you tried our PDFDebugger app? Personally I uses Acrobat Pro for
    figuring out thorny issues, but I know that’s not an option for everyone.


-- John

> On 20 Nov 2014, at 05:18, Peter Murray-Rust <pm286@cam.ac.uk> wrote:
> I have built PDF2SVG (https://bitbucket.org/petermr/pdf2svg/wiki/Home) on
> top of PDFBox, and I use
> org.apache.pdfbox.pdmodel.font.PDFont.getFontWidth( byte[] c, int offset,
> int length )
> to get the width of characters, which I then use to calculate spacing and
> thereby words. I have processed many documents successfully, but have found
> one that causes problems:
> apps.who.int/iris/bitstream/10665/143216/1/roadmapsitrep_14Nov2014_eng.pdf
> Paragraphs 1 ("A total ...") and 2 ("Following...")  have a font
> "ABCDEE+Calibri" where nearly all character widths are 1000 (which is
> clearly wrong). Para 3 ("In Mali"...) apparently has the same font name
> ("ABCDEE+Calibri") but has different spacings for each character,  which
> then give a proper layout.
> I don't know whether the document or my code is wrong and I'd be grateful
> for a very quick reality-check. I've run it through PDFTextStripper and it
> behaves properly. [I don't use PDFTextStripper because I want to preserve
> individual characters, including styles, weights and sub/superscripts.]
> (a) Do paragraphs 1 and 2 use the same font as 3? What are the font names?
> (b) Is it allowed to have the same name for 2 different fonts (it would be
> very bad...)
> (c) How does PDFTextStripper calculate spaces? From the Font, or by some
> other heuristics?
> (d) is there a debugging tool on PDFBox I could be using for this sort of
> problem?
> Many thanks.
> P.
> -- 
> Peter Murray-Rust
> Reader in Molecular Informatics
> Unilever Centre, Dep. Of Chemistry
> University of Cambridge
> CB2 1EW, UK
> +44-1223-763069

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message