pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Hewson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PDFBOX-3062) Text extraction and height different in 2.0
Date Wed, 02 Dec 2015 17:41:11 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036215#comment-15036215
] 

John Hewson edited comment on PDFBOX-3062 at 12/2/15 5:40 PM:
--------------------------------------------------------------

{quote}
How is it not reliable?
{quote}

Why would it be? There's no reason it should be more accurate than the bbox - neither are
used during rendering.

{quote}
That's why CapHeight is used when the BBox isn't helpful.
{quote}

The CapHeight also isn't a good proxy for a glyph's visual bounds. Many glyphs will be higher
or lower than that.

{quote}
calculate a new BBox from actual glyphs: will make software slower
{quote}

Sounds like FUD to me.

As I see it there are two questions:

1) what is the correct thing to do?
2) what should we do for 2.0?


was (Author: jahewson):
{quote}
How is it not reliable?
{quote}

Why would it be? There's no reason it should be more accurate than the bbox - neither are
used during rendering.

{quote}
That's why CapHeight is used when the BBox isn't helpful.
{quote}

The CapHeight also isn't a good proxy for a glyph's visual bounds. Many glyphs will be higher
or lower than that.

{quote}
calculate a new BBox from actual glyphs: will make software slower
{quote}

Sounds like FUD to me.

> Text extraction and height different in 2.0
> -------------------------------------------
>
>                 Key: PDFBOX-3062
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3062
>             Project: PDFBox
>          Issue Type: Sub-task
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>             Fix For: 2.0.0
>
>         Attachments: 005021-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced-marked-1.png,
PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB.pdf,
PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf, garbled text 2.pdf
>
>
> AR:
> {code}
> WITH THE increasing complexity of optical modules,
> {code}
> 1.8:
> {code}
> WITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472 width=28.214272]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=3.3176804]I
> String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T
> String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H
> String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T
> String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H
> {code}
> 2.0:
> {code}
> W
> ITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472 width=28.209717]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=3.3177567]I
> String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T
> String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H
> String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T
> String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message