pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PDFBOX-1691) "Foreign" characters are not rendered
Date Sat, 04 Jan 2014 11:22:50 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-1691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13747702#comment-13747702
] 

Tilman Hausherr edited comment on PDFBOX-1691 at 1/4/14 11:22 AM:
------------------------------------------------------------------

I did some research into this... short version: the font uses a deprecated feature. However
one cannot "deprecate" a font of an existing PDF file for the past, unless time travel works.

Long version: the problem is that the glyphs have empty paths.
So I tried to find out why this is so. The creation of the paths is done in handleCommandType2()
in
CharStringRenderer.java

I looked into the document "The Type 2 Charstring Format".

This is the sequence for "Scaron" char of the DIN BOLD font in the "bimbo" file:

A9 85 9F 01 B7 F7 45 DE F7 63 0E

The first is a length, 85 9F 01 means HSTEM, and B7 F7 45 DE F7 63 0E (see page 13) means
(decimal) 44 177 83 207 ENDCHAR.
These are the same numbers that are in charStringDict in the fonts variable.

ENDCHAR is described on page 21. Look at Note 8:

endchar also has a deprecated function; see Appendix C, “Comaptibility [sic] and Deprecated
Operators.”

Page 35:
– adx ady bchar achar endchar (14) |–

endchar may have four extra arguments that correspond exactly to the last four
arguments of the Type 1 charstring command “seac” (see Type 1 Font Format book).

So I looked into the Type 1 Font Format book on page 35:

seac = "standard encoding accented character", 
makes an accented character from two other characters in its font program.

The origin of the accent is placed at (adx,ady) relative to the origin of
the base character. The bchar argument is the character code of the base character, 
and the achar argument is the character code of the accent character. Both
bchar and achar are codes that these characters are assigned in the Adobe StandardEncoding
vector, given in an Appendix in the PostScript Language Reference Manual. Furthermore, the
characters represented by achar and bchar must be in the same positions in the font’s encoding
vector as the positions they occupy in the Adobe StandardEncoding vector.

So here, 
adx = 44
ady = 177
bchar = 83, octal 0123, this is an "S", see page 794 in the PostScript Language Reference
Manual
achar = 207, octal 0317, yes this is the "caron" 


The solution would probably be to replace the charStringDict with the data of the two characters.

Tricky.

(see also PDFBOX-1501)

I did a temporary solution to use DrawString (which works but doesn't look as good) instead
of drawGlyph2D() when the paths are empty.

In processTextPosition():

{code}
(...)
            else
            {
                Glyph2D glyph2D = createGlyph2D(font);
if (glyph2D != null && !checkDrawGlyph2D(glyph2D, text)) //TH
{
    LOG.warn("font '" + font.getBaseFont() + "': text '" + text + "' rendered with plan B
because path is empty");
    glyph2D = null;
}
                if (glyph2D != null)
(...)



    private boolean checkDrawGlyph2D(Glyph2D glyph2D, TextPosition text)
    {
        int[] codePoints = text.getCodePoints();
        for (int i = 0; i < codePoints.length; i++)
        {
            GeneralPath path = glyph2D.getPathForCharactercode(codePoints[i]);

            if (path == null || (path.getPathIterator(null).isDone() && text.getCharacter().charAt(i)
!= ' '))
                return false;
        }
        return true;
    }
{code}


was (Author: tilman):
I did some research into this... short version: the font uses a deprecated feature. However
one cannot "deprecate" a font of an existing PDF file for the past, unless time travel works.

Long version: the problem is that the glyphs have empty paths.
So I tried to find out why this is so. The creation of the paths is done in handleCommandType2()
in
CharStringRenderer.java

I looked into the document "The Type 2 Charstring Format".

This is the sequence for "Scaron" char of the DIN BOLD font in the "bimbo" file:

A9 85 9F 01 B7 F7 45 DE F7 63 0E

The first is a length, 85 9F 01 means HSTEM, and B7 F7 45 DE F7 63 0E (see page 13) means
(decimal) 44 177 83 207 ENDCHAR.
These are the same numbers that are in charStringDict in the fonts variable.

ENDCHAR is described on page 21. Look at Note 8:

endchar also has a deprecated function; see Appendix C, “Comaptibility [sic] and Deprecated
Operators.”

Page 35:
– adx ady bchar achar endchar (14) |–

endchar may have four extra arguments that correspond exactly to the last four
arguments of the Type 1 charstring command “seac” (see Type 1 Font Format book).

So I looked into the Type 1 Font Format book on page 35:

seac = "standard encoding accented character", 
makes an accented character from two other characters in its font program.

The origin of the accent is placed at (adx,ady) relative to the origin of
the base character. The bchar argument is the character code of the base character, 
and the achar argument is the character code of the accent character. Both
bchar and achar are codes that these characters are assigned in the Adobe StandardEncoding
vector, given in an Appendix in the PostScript Language Reference Manual. Furthermore, the
characters represented by achar and bchar must be in the same positions in the font’s encoding
vector as the positions they occupy in the Adobe StandardEncoding vector.

So here, 
adx = 44
ady = 177
bchar = 83, octal 0123, this is an "S", see page 794 in the PostScript Language Reference
Manual
achar = 207, octal 0317, yes this is the "caron" 


The solution would probably be to replace the charStringDict with the data of the two characters.

Tricky.

(see also PDFBOX-1501)

I did a temporary solution to use DrawString (which works but doesn't look as good) instead
of drawGlyph2D() when the paths are empty.

In processTextPosition():

(...)
            else
            {
                Glyph2D glyph2D = createGlyph2D(font);
if (glyph2D != null && !checkDrawGlyph2D(glyph2D, text)) //TH
{
    LOG.warn("font '" + font.getBaseFont() + "': text '" + text + "' rendered with plan B
because path is empty");
    glyph2D = null;
}
                if (glyph2D != null)
(...)



    private boolean checkDrawGlyph2D(Glyph2D glyph2D, TextPosition text)
    {
        int[] codePoints = text.getCodePoints();
        for (int i = 0; i < codePoints.length; i++)
        {
            GeneralPath path = glyph2D.getPathForCharactercode(codePoints[i]);

            if (path == null || (path.getPathIterator(null).isDone() && text.getCharacter().charAt(i)
!= ' '))
                return false;
        }
        return true;
    }


> "Foreign" characters are not rendered
> -------------------------------------
>
>                 Key: PDFBOX-1691
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1691
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.0
>         Environment: XP, 7, Java 7.25
>            Reporter: Tilman Hausherr
>         Attachments: 1822-AGB-03-good.png, 1822-AGB-03.png, 1822-AGB.pdf, Bimbo_Historia_20070409_Esp-02-good.png,
Bimbo_Historia_20070409_Esp.pdf, PDFBOX-1691-FORIS-HV.pdf, PDFBOX-1691-FORIS-HV.pdf-2.png
>
>
> In the attached file (from page 3 of the pdf file), the letters ä, ö and ü are not
rendered.
> I am using the version of last weekend.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message