pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-2463) ExtractTextByArea mangling second half of this string - transposed, skipped, etc
Date Mon, 03 Nov 2014 18:45:35 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194880#comment-14194880
] 

Andreas Lehmkühler commented on PDFBOX-2463:
--------------------------------------------

Which area did you use? Extracting the whole text works like a charm.

> ExtractTextByArea mangling second half of this string - transposed, skipped, etc
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2463
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2463
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7
>            Reporter: Joel Hirsh
>         Attachments: mangled_text .pdf
>
>
> PDF snippet is being completely mangled by ExtractTextByArea.  Have a large PDF file
where this is happening on every line.  
> Visually (and Acrobat) show the text:
> 12 Jun EP COPY WORKS LIMITED 503646200256 5637 3.70 11,252.49 OD
> However ExtractTextByArea comes up with:
> 12 Jun EP COPY WORKS LIMITED 503646200256 35 .6 70
> 11,
> 3 257 2.49
> OD
> So the first half of the string is ok, but starting at '5637' characters are skipped,
other characters are inserted, completely mangled.
> FWIW I did dump the COSString's in PDFStreamEngine and the strings all show correctly,
nothing unusual.
> Test file to be attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message