pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Minniear (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-561) Text extraction with PDFTextStripper is system file.encoding dependent. Override does not work.
Date Fri, 01 Jul 2011 20:49:28 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058789#comment-13058789
] 

Ryan Minniear commented on PDFBOX-561:
--------------------------------------

Are there any plans for when this bug will be fixed? I am working on a team which uses PDFBox
(indirectly through Solr/Tika) and would very much like to see this problem addressed.

> Text extraction with PDFTextStripper is system file.encoding dependent. Override does
not work.
> -----------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-561
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-561
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3, 0.8.0-incubator
>            Reporter: d ferbas
>         Attachments: blindtext_mit_bullets.pdf
>
>
> The text extraction depends on the jvm file.encoding setting. The "override" new PDFTextStripper("utf-8")
(since version 0.8.0) has no effect.
> If there are critical characters in a pdf file, the extracted string differs dependent
of the jvm system encoding. 
> It has to be possible to set the encoding for the extraction to ensure same results independent
of the default system encoding.
> Sample file: see attachment "blindtext_mit_bullets.pdf"
> Bullets #3 to #8 differ using utf-8 vs cp1252
> Be aware that the file.encoding setting only works if passed while starting the jvm (-Dfile.encoding=utf-8).
System.setProperty(..) does not work.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message