pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-1424) Wrong glyph (Persian) is used in extacted text instead of the original glyph (Persian) in PDF file
Date Thu, 01 Nov 2012 17:47:13 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-1424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andreas Lehmkühler updated PDFBOX-1424:
---------------------------------------

    Attachment: PDFBOX1424-persian_test.html
    
> Wrong glyph (Persian)  is used in extacted text instead of the original glyph (Persian)
in PDF file
> ---------------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1424
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1424
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.1
>         Environment: Windows XP, Java 1.6.0
>            Reporter: Ali Majdzadeh Kohbanani
>         Attachments: PDFBOX1424-persian_test.html, persian_test.html, persian_test.pdf
>
>
> Hi
> I am very new to PDFBox and I am dealing with Persian PDF files. When I convert Persian
PDF files using PDFBox-app, some Persian glyphs like م are displayed wrongly in the extracted
text. For example, the word "هستم" in Persian is extracted as "هستن" and "من" in
Persian is extracted as "هن". Also, the word "سلام" is extracted as "سالم". By the
way, I have tested extracting text from a complete Persian PDF file with multiple pages; the
result is disappointing. Actually, it is totally wrong. Please let me know if I should upload
an example Persian PDF file.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message