pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrian Romano (JIRA)" <j...@apache.org>
Subject [jira] Created: (PDFBOX-398) Russian extraction encoding failure
Date Mon, 05 Jan 2009 19:55:44 GMT
Russian extraction encoding failure

                 Key: PDFBOX-398
                 URL: https://issues.apache.org/jira/browse/PDFBOX-398
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.7.3, 0.8.0-incubator
         Environment: Windows XP 32-bit, CentOS 5.2 32-bit
            Reporter: Adrian Romano

I am doing some text extraction of Russian documents and some of them aren't extracting correctly.
I am using PDFTextStripper.
When I extract on windows using UTF-8  encoding, the output is garbage. 
When I extract on linux using any encoding, the output is garbage. 
The only way I can get viable output is when I extract the PDF on windows, but don't specify
an encoding. If I do this the output is correct when viewed with Ultra Edit, but not in notepad.
I can view the output in notepad only after I convert the file to utf-8 with iconv.
It appears to me that the encoding isn't being read correctly from the PDF, and when it's
outputted as UTF-8, it is being double encoded or something. I can detect this double encoding,
and then
run the file with no encoding specified, then convert it to UTF-8 using iconv, and it is OK.
But, this method does not work on linux, as I cannot get the file to extract correctly using
any encoding
on linux. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message