pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yigal Dayan (JIRA)" <j...@apache.org>
Subject [jira] Updated: (PDFBOX-684) Incorrect ordering of compound Arabic glyphs
Date Thu, 08 Apr 2010 08:09:38 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yigal Dayan updated PDFBOX-684:
-------------------------------

    Attachment: zzz.after_fix.txt
                zzz.before_fix.txt
                zzz.pdf

Attaching sample pdf and two utf8 outputs (beore and after fix)

> Incorrect ordering of compound Arabic glyphs
> --------------------------------------------
>
>                 Key: PDFBOX-684
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-684
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.0.0, 1.1.0
>            Reporter: Yigal Dayan
>            Priority: Minor
>         Attachments: zzz.after_fix.txt, zzz.before_fix.txt, zzz.pdf
>
>   Original Estimate: 3h
>  Remaining Estimate: 3h
>
> Some Arabic PDFs contain compound glyphs for stylistic reasons.
> Such glyphs encode two letters: FI, SI, LI, LJ, LM, etc.
> Before a line gets sent to the bidirectional algorithm, all characters have been sorted
into a visual order, except for these pairs. This is because they are handled as one unit
and maintain their original (logical) order. The bidi algorithm straightens out most characters,
but reverses the glyph pairs.
> To fix this, the output of font.encode() should be examined and reversed on the spot
if it contains pairs of Arabic characters. Possibly you need to add a stub method to PDFStreamEngine
(in method processEncodedText) that PDFTextStripper can override (in sort mode only).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message