pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Carrier (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (PDFBOX-377) Incorrect direction of extracted Arabic Text
Date Tue, 13 Jan 2009 19:20:20 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Brian Carrier resolved PDFBOX-377.

       Resolution: Fixed
    Fix Version/s: 0.8.0-incubator
         Assignee: Brian Carrier

Patch checked into trunk revision 734151.

> Incorrect direction of extracted Arabic Text
> --------------------------------------------
>                 Key: PDFBOX-377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Brian Carrier
>            Assignee: Brian Carrier
>             Fix For: 0.8.0-incubator
>         Attachments: hello3.pdf, PDFTextStripper.diff, reorder-patch.zip
> Arabic text (and other right to left languages) is stored in presentation format in PDF
files, which is the opposite of the logical order that Arabic text is typically stored. Arabic
text is typically stored such that the first byte is for the right-most character, but the
output of PDFBox has the first byte always being the left-most character. 
> Further, PDF files typically store the presentation form of Arabic characters instead
the more general form. For example, U+FB50 instead of U+0671. The presentation form is not
supposed to be stored in the logical form, but PDFBox does not normalize them out. 
> The attached patch solves both of these problems using the ICU4J library (http://www.icu-project.org/).
 It identifies the dominant text direction of each page and reverses the order of each line
(only if any right to left text exists).  It then normalizes the text to remove the presentation
> An example file is attached.  Without the patch, the following is (incorrectly) produced:
> Hello ﺪﻤﺤﻣ World. 
> With the patch, the following is (correctly) produced:
> Hello محمد World. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message