Return-Path: X-Original-To: apmail-pdfbox-dev-archive@www.apache.org Delivered-To: apmail-pdfbox-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D7F9188CA for ; Tue, 1 Dec 2015 20:56:11 +0000 (UTC) Received: (qmail 2264 invoked by uid 500); 1 Dec 2015 20:56:11 -0000 Delivered-To: apmail-pdfbox-dev-archive@pdfbox.apache.org Received: (qmail 2243 invoked by uid 500); 1 Dec 2015 20:56:11 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 1971 invoked by uid 99); 1 Dec 2015 20:56:11 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Dec 2015 20:56:11 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 01DE02C1F6F for ; Tue, 1 Dec 2015 20:56:11 +0000 (UTC) Date: Tue, 1 Dec 2015 20:56:11 +0000 (UTC) From: "Tilman Hausherr (JIRA)" To: dev@pdfbox.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (PDFBOX-3062) Text extraction and height different in 2.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/PDFBOX-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034552#comment-15034552 ] Tilman Hausherr commented on PDFBOX-3062: ----------------------------------------- {quote} BBox + CapHeight isn't reliable either. {quote} How is it not reliable? Do you know of any files that get less tokens extracted? {quote} The BBox is fine, it's just not what you want it to be, i.e. a meaningful proxy for a glyph's visual bounds. {quote} That's why CapHeight is used when the BBox isn't helpful. {quote} So are we going to solve that problem or not?{quote} Don't know. The alternatives are: - use BBox only: some files won't be extracted nicely - use BBox + Capheight: more files will be extracted nicely, but the code will have 9 extra lines you don't like - calculate a new BBox from actual glyphs: will make software slower, will delay release, may or may not be more reliable. (If a font subset has only non-capital glyphs, then the 1/2 of the "real bbox" would be too small) > Text extraction and height different in 2.0 > ------------------------------------------- > > Key: PDFBOX-3062 > URL: https://issues.apache.org/jira/browse/PDFBOX-3062 > Project: PDFBox > Issue Type: Sub-task > Components: Text extraction > Affects Versions: 2.0.0 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > Fix For: 2.0.0 > > Attachments: 005021-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced-marked-1.png, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB-reduced.pdf, PDFBOX-3062-H6NIYQXHLPGD3GI6SNIYINRAZBCDHUCB.pdf, PDFBOX-3062-N2MOQ7YZICIYGTPLQJAWJ4HLN6CCEMHZ-reduced.pdf, garbled text 2.pdf > > > AR: > {code} > WITH THE increasing complexity of optical modules, > {code} > 1.8: > {code} > WITH THE increasing complexity of optical modules, > String[39.6,399.6 fs=1.0 xscale=29.888 height=20.114626 space=7.472 width=28.214272]W > String[69.488,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=3.3176804]I > String[72.80568,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T > String[78.893074,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H > String[90.71916,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=6.0873947]T > String[96.80656,386.16 fs=1.0 xscale=9.963 height=6.5955067 space=2.49075 width=7.1932907]H > {code} > 2.0: > {code} > W > ITH THE increasing complexity of optical modules, > String[39.6,399.6 fs=1.0 xscale=29.888 height=9.584274 space=7.472 width=28.209717]W > String[69.488,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=3.3177567]I > String[72.805756,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T > String[78.891556,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H > String[90.719315,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=6.0858]T > String[96.805115,386.16 fs=1.0 xscale=9.963 height=3.194865 space=2.49075 width=7.1949615]H > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org