pdfbox-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Hewson <j...@jahewson.com>
Subject Re: Custom glyph maps for a fonts
Date Wed, 29 Apr 2015 05:32:11 GMT
Hi Zeev,

> On 28 Apr 2015, at 12:50, Zeev Sands <zeev.sands@gmail.com> wrote:
> Hello everyone,
> I've been using pdfbox 2.0 for a couple of weeks and came across an issue with a some
symbol fonts (WPIconicSymbolsA and WPTypographicSymbols):
> I needed to convert the symbols to their unicode equivalents, so I cooked up a small
class to do that. No problems there.

It’s not clear exactly what you’re trying to do, you’re talking about extracting text
from a PDF? I’m going to assume that you are.

> My issue is - some of the symbols coming in are already being converted and some are
not. I do see that there is a list of glyphs that is being loaded to do just that (glyphlist.txt)
and there is an additional list (additional.txt) for more glyphs. What I don't understand
is how a glyph can be mapped without specifying a font name, for example in WPIconicSymbolsA
dec 33 is an outline of a heart, in WPTypographicSymbols dec 33 is a large filled dot.

PDF allows any “simple” font to have a PostScript Type 1 encoding overlaid onto it, so
even though the font may be a TTF, there’s another layer of encoding. In some cases the
original fonts encoding is stripped, so this is the only encoding, in other cases the PostScript
encoding is empty and the TTF’s built-in encoding takes over.

Type 1 fonts pre-date Unicode. In a Type 1 font each glyph has a name, which is a string,
such as “Euro”. An encoding is a map of 8-bit codes to names, for example WinAnsiEncoding
is the Type 1 version of the familiar Windows-1252 encoding. So we’d have 128 => “Euro”,
in that case.

Later on, when Unicode was created, Adobe provided the glyphlist.txt to map from the standard
glyph names to Unicode code points, e.g. “Euro” => U+20AC. Combined with a Type 1 encoding,
this lets us read a code in a PDF file and convert it to Unicode, e.g. 128 => “Euro”
=> U+20AC. This is a global mapping, so we don’t need one per font.

Some fonts use non-standard names for glyphs, usually because the glyph is unusual and no
standard exists. PDF provides numerous mechanisms for such glyphs to be mapped to Unicode
and one of these is to look up the name in the standard glyph list. PDFBox ships with an additional,
non-standard glyph list which covers some commonly encountered glyphs such as those found
in TeX. This is a bit of a hack, but such typically don’t use any of the other Unicode mechanisms
provided by PDF, so this is a last resort for mapping such glyphs to Unicode.

> So to be specific, my questions are :
>  Is there any way to give pdf box a map *per font*?

A glyph’s name should uniquely identify that glyph, so this shouldn’t be necessary. Just
add the missing names to additional.txt.

>  What is the philosophy of glyph conversion how are different fonts converted to different
unicode characters?

Hopefully I’ve covered that above. The overall philosophy is to avoid hard-coding where
possible and infer Unicode from the PDF wherever possible.

> Please, let me know if I am looking at the whole thing incorrectly. Perhaps there is
an easier way…

If you upload the PDF to a public URL then I can take a look at it and see exactly what the
issue is.

> Thank you,
> Zeev

To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

View raw message