pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jukka Zitting (JIRA)" <j...@apache.org>
Subject [jira] Created: (PDFBOX-358) Vertical text extraction splitting text
Date Mon, 04 Aug 2008 18:20:44 GMT
Vertical text extraction splitting text
---------------------------------------

                 Key: PDFBOX-358
                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
            Reporter: Jukka Zitting


[Issue from SourceForge]
http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832

Vertical text gets splitted during extraction using PDFTextStripper.

"Specification" gives:
Spécif
ic
ations

This is made worse when sorted by position, as it gets mixed up with the
horizontal text:
ic
ations
[CLASSIFIED INFO]
[CLASSIFIED INFO]
Spécif [CLASSIFIED INFO]
[CLASSIFIED INFO]

I'm afraid I can't provide the PDF in question due to confidentiality
requirements. It's a PDF obtained from the conversion to PDF of a Windows
Word document. According to the forums I'm not the only one with this
problem.

[Comment on SourceForge]
Date: 2008-06-02 09:11
Sender: totoll
Logged In: YES 
user_id=2096423
Originator: YES

To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30

[Comment on SourceForge]
Sender: totoll
Logged In: YES 
user_id=2096423
Originator: YES

I have attached an admittedly very complicated PDF document which (as far
as I can tell) features 90° and 135° rotated text in a 90° rotated page.


Position-ordered text extraction gives horrible results. 

Normal text extraction is also very messy, although in this second case
the results are almost understandable. 

This is not the document I need to treat, but i think that if text can be
correctly extracted from that PDF, it should work for almost every other
existing PDF.
File Added: Flyer2.pdf
http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message