pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Lehmkühler (JIRA) <j...@apache.org>
Subject [jira] Resolved: (PDFBOX-358) Vertical text extraction splitting text
Date Thu, 08 Jan 2009 19:56:59 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andreas Lehmkühler resolved PDFBOX-358.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0-incubator

> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>             Fix For: 0.8.0-incubator
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02
10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message