pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael Klink (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (PDFBOX-4236) PDFTextStripper diacritic merge sometimes chooses wrong base glyph
Date Mon, 04 Jun 2018 15:59:00 GMT

     [ https://issues.apache.org/jira/browse/PDFBOX-4236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Michael Klink updated PDFBOX-4236:
    Priority: Minor  (was: Major)

> PDFTextStripper diacritic merge sometimes chooses wrong base glyph
> ------------------------------------------------------------------
>                 Key: PDFBOX-4236
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4236
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 3.0.0 PDFBox
>            Reporter: Michael Klink
>            Priority: Minor
>         Attachments: SA-U-NA.png, pattern3.pdf
> In the course of answering [this stack overflow question|https://stackoverflow.com/q/50664162/1729265]
I saw that text extraction from the example file  [^pattern3.pdf] exposes an error in the
diacritic merging code, the wrong base glyph is chosen.
> From the bottom of [my answer|https://stackoverflow.com/a/50679508/1729265] there:
> {quote}By the way, your test file exposes an error in the PDFBox determination of the
base glyph to merge a diacritic with: The "स[1434]ु[1441]न[1418]" is meant to be rendered
as "सुन", i.e. the vowel sign u "ु" is combined with the letter sa "स", but PDFBox
combines it with the subsequent letter na "न" as "सनु".
> The cause is that it determines the letter to combine the diacritic with by its origin
which here indeed is in the range of the latter letter na "न", but as the vowel sign glyph
is rendered before its origin (it is drawn in an area with a negative x coordinate), PDFBox
determines the wrong association:
> !SA-U-NA.png! 
> {quote}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

View raw message