pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Brian Carrier (JIRA)" <j...@apache.org>
Subject [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction
Date Wed, 18 Feb 2009 21:36:01 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Brian Carrier resolved PDFBOX-430.

    Resolution: Fixed

Fixed with patch by Ken Glidden that merges a single diacritic text chunk into the previous
text chunk if they overlap.  Note that this will not solve problems where the diacritic comes
much after the text chunk it overlays, but we have not observed PDF files like that.

Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
Sending        trunk/test/input/Acrobat9.pdf.txt
Transmitting file data ....Committed revision 745665.

> Incorrect diacritic placement in text extraction
> ------------------------------------------------
>                 Key: PDFBOX-430
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
>             Project: PDFBox
>          Issue Type: Bug
>            Reporter: Brian Carrier
> Some PDF files store diacritics (accents over characters) as separate text elements.
The PDF files essentially have a chunk of text and then backup and place the diacritic over
one of the characters in the chunk of text. With text extraction, the current design does
not allow the diacritic to be placed over a character in the chunk and instead it is placed
after the chunk. 
> The debug-diac2.pdf file in PDFBOX-429 shows this problem. 

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message