pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremias Maerki <...@jeremias-maerki.ch>
Subject Re: [jira] Resolved: (PDFBOX-430) Incorrect diacritic placement in text extraction
Date Sat, 28 Feb 2009 17:26:07 GMT

you state here that you've applied a patch by one Ken Glidden. I cannot
find any post or submission from a person with that name on the PDFBox
mailing lists. So I'm concerned about the legal trail here. Can you
explain that, please? Thank you.

On 18.02.2009 22:36:01 Brian Carrier (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/PDFBOX-430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> Brian Carrier resolved PDFBOX-430.
> ----------------------------------
>     Resolution: Fixed
> Fixed with patch by Ken Glidden that merges a single diacritic text chunk into the previous
text chunk if they overlap.  Note that this will not solve problems where the diacritic comes
much after the text chunk it overlays, but we have not observed PDF files like that.
> Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
> Sending        trunk/src/main/java/org/apache/pdfbox/util/TextPosition.java
> Sending        trunk/test/input/Acrobat9.pdf-sorted.txt
> Sending        trunk/test/input/Acrobat9.pdf.txt
> Transmitting file data ....Committed revision 745665.
> > Incorrect diacritic placement in text extraction
> > ------------------------------------------------
> >
> >                 Key: PDFBOX-430
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-430
> >             Project: PDFBox
> >          Issue Type: Bug
> >            Reporter: Brian Carrier
> >
> > Some PDF files store diacritics (accents over characters) as separate text elements.
The PDF files essentially have a chunk of text and then backup and place the diacritic over
one of the characters in the chunk of text. With text extraction, the current design does
not allow the diacritic to be placed over a character in the chunk and instead it is placed
after the chunk. 
> > The debug-diac2.pdf file in PDFBOX-429 shows this problem. 
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.

Jeremias Maerki

View raw message