pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "John Hewson (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (PDFBOX-3062) Text extraction and height different in 2.0
Date Fri, 11 Dec 2015 20:00:47 GMT

    [ https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053474#comment-15053474
] 

John Hewson edited comment on PDFBOX-3062 at 12/11/15 8:00 PM:
---------------------------------------------------------------

To wrap this one up, revisiting my questions above:

{quote}
1) what is the correct thing to do?
2) what should we do for 2.0?
{quote}

The answers would seem to be:

1) visual bounds but only if its not really slow (and we don't know, because we haven't tried)
2) keep the 1.8 heuristic to be pragmatic

So, we should revisit this later but for 2.0 there's clearly strong support for keeping the
existing heuristic. Issue resolved.


was (Author: jahewson):
To wrap this one up, revisiting my questions above:

{quote}
1) what is the correct thing to do?
2) what should we do for 2.0?
{quote}

The answers would seem to be:

1) visual bounds but only if its not really slow (and we don't know, because we haven't tried)
2) keep the 1.8 heuristic to be pragmatic

> Text extraction and height different in 2.0
> -------------------------------------------
>
>                 Key: PDFBOX-3062
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3062
>             Project: PDFBox
>          Issue Type: Sub-task
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>             Fix For: 2.0.0
>
>         Attachments: 005021-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced-marked-1.png,
PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB.pdf,
PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf, garbled text 2.pdf
>
>
> AR:
> {code}
> WITH THE increasing complexity of optical modules,
> {code}
> 1.8:
> {code}
> WITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472 width=28.214272]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=3.3176804]I
> String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T
> String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H
> String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T
> String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H
> {code}
> 2.0:
> {code}
> W
> ITH THE increasing complexity of optical modules,
> String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472 width=28.209717]W
> String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=3.3177567]I
> String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T
> String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H
> String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T
> String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Mime
View raw message