Return-Path: X-Original-To: apmail-pdfbox-dev-archive@www.apache.org Delivered-To: apmail-pdfbox-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9984018312 for ; Wed, 2 Dec 2015 17:41:21 +0000 (UTC) Received: (qmail 78797 invoked by uid 500); 2 Dec 2015 17:41:12 -0000 Delivered-To: apmail-pdfbox-dev-archive@pdfbox.apache.org Received: (qmail 78719 invoked by uid 500); 2 Dec 2015 17:41:12 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 78288 invoked by uid 99); 2 Dec 2015 17:41:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Dec 2015 17:41:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 5C4ED2C1F8F for ; Wed, 2 Dec 2015 17:41:11 +0000 (UTC) Date: Wed, 2 Dec 2015 17:41:11 +0000 (UTC) From: "John Hewson (JIRA)" To: dev@pdfbox.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (PDFBOX-3062) Text extraction and height different in 2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036215#comment-15036215 ] John Hewson edited comment on PDFBOX-3062 at 12/2/15 5:40 PM: -------------------------------------------------------------- {quote} How is it not reliable? {quote} Why would it be? There's no reason it should be more accurate than the bbox - neither are used during rendering. {quote} That's why CapHeight is used when the BBox isn't helpful. {quote} The CapHeight also isn't a good proxy for a glyph's visual bounds. Many glyphs will be higher or lower than that. {quote} calculate a new BBox from actual glyphs: will make software slower {quote} Sounds like FUD to me. As I see it there are two questions: 1) what is the correct thing to do? 2) what should we do for 2.0? was (Author: jahewson): {quote} How is it not reliable? {quote} Why would it be? There's no reason it should be more accurate than the bbox - neither are used during rendering. {quote} That's why CapHeight is used when the BBox isn't helpful. {quote} The CapHeight also isn't a good proxy for a glyph's visual bounds. Many glyphs will be higher or lower than that. {quote} calculate a new BBox from actual glyphs: will make software slower {quote} Sounds like FUD to me. > Text extraction and height different in 2.0 > ------------------------------------------- > > Key: PDFBOX-3062 > URL: https://issues.apache.org/jira/browse/PDFBOX-3062 > Project: PDFBox > Issue Type: Sub-task > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > Fix For: 2.0.0 > > Attachments: 005021-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced-marked-1.png, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB.pdf, PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf, garbled text 2.pdf > > > AR: > {code} > WITH THE increasing complexity of optical modules, > {code} > 1.8: > {code} > WITH THE increasing complexity of optical modules, > String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472 width=28.214272]W > String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=3.3176804]I > String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T > String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H > String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T > String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H > {code} > 2.0: > {code} > W > ITH THE increasing complexity of optical modules, > String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472 width=28.209717]W > String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=3.3177567]I > String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T > String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H > String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T > String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org