pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tilman Hausherr (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0
Date Wed, 02 Dec 2015 18:00:15 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036261#comment-15036261
] 

Tilman Hausherr commented on PDFBOX-3062:
-----------------------------------------

Re reliability, see the test files and the files in this issue and related issues. Their extraction
improve thanks to the last change.

{quote}
The CapHeight also isn't a good proxy for a glyph's visual bounds. Many glyphs will be higher
or lower than that.
{quote}
I doubt that glyphs will be higher. And I'm only using it if it is smaller than BBox height
/ 2.

"will make software slower" is not FUD, it is logical: there are extra calculations to make.
We need to get the paths, which we don't need currently, and we'll need to calculate the bounds.

> Text extraction and height different in 2.0
> -------------------------------------------
>
>                 Key: PDFBOX-3062
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3062
>             Project: PDFBox
>          Issue Type: Sub-task
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>             Fix For: 2.0.0
>
>         Attachments: 005021-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced-marked-1.png,
PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB.pdf,
PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf, garbled text 2.pdf
>
>
> AR:
> {code}
> WITH THE increasing complexity of optical modules,
> {code}
> 1.8:
> {code}
> WITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472 width=28.214272]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=3.3176804]I
> String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T
> String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H
> String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T
> String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H
> {code}
> 2.0:
> {code}
> W
> ITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472 width=28.209717]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=3.3177567]I
> String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T
> String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H
> String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T
> String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message