pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Emilian Bold (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-4313) PDFTextStripper groups unrelated chunks into words
Date Fri, 14 Sep 2018 15:26:00 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-4313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Emilian Bold updated PDFBOX-4313:
---------------------------------
    Attachment: 1536938716546.pdf

> PDFTextStripper groups unrelated chunks into words
> --------------------------------------------------
>
>                 Key: PDFBOX-4313
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4313
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.11
>            Reporter: Emilian Bold
>            Priority: Major
>         Attachments: 1536938716546.pdf, PDFBOX-4313.pdf, crop-fisa-sintetica.png
>
>
> I have the text "10" and "11" and they get merged into to "1110" word.
> Coordinates are:
> 1 575.36 x 227.4 w 4.447998 h 5.736
> 1 579.752 x 227.4 w 4.447998 h 5.736
> 1 526.2 x 227.4 w 4.447998 h 5.736
> 0 530.59204 x 227.4 w 4.447998 h 5.736
> The bug is in in this PDFTextStripper chunk:
> {{
>                    // test if our TextPosition starts after a new word would be expected
to start
>                     if (expectedStartOfNextWordX != EXPECTED_START_OF_NEXT_WORD_X_RESET_VALUE
>                             && expectedStartOfNextWordX < positionX &&
>                             // only bother adding a space if the last character was not
a space
>                             lastPosition.getTextPosition().getUnicode() != null
>                             && !lastPosition.getTextPosition().getUnicode().endsWith("
"))
>                     {
>                         line.add(LineItem.getWordSeparator());
>                     }
> }}
> which seems to add a word separator only if the next char is "after" the current word.
It never expects that the next char might be "before" the current word.
> I guess this could also be framed as a RTL problem, but the PDF is a plain PDF, it just
seems that Oracle Reports generates these chunks in the reverse order.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message