From dev-return-62102-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org Sat Apr 6 15:19:22 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 2C66C180627 for ; Sat, 6 Apr 2019 17:19:22 +0200 (CEST) Received: (qmail 23845 invoked by uid 500); 6 Apr 2019 15:19:21 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 23833 invoked by uid 99); 6 Apr 2019 15:19:20 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Apr 2019 15:19:20 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 252071809C6 for ; Sat, 6 Apr 2019 15:19:20 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 3.802 X-Spam-Level: *** X-Spam-Status: No, score=3.802 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, KAM_BADIPHTTP=2, KAM_LAZY_DOMAIN_SECURITY=1, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_DNSWL_NONE=-0.0001, URIBL_BLOCKED=0.001] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id 0zDJVsTPY90M for ; Sat, 6 Apr 2019 15:19:18 +0000 (UTC) Received: from mailout06.t-online.de (mailout06.t-online.de [194.25.134.19]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id B0E385F17E for ; Sat, 6 Apr 2019 15:19:17 +0000 (UTC) Received: from fwd09.aul.t-online.de (fwd09.aul.t-online.de [172.20.27.151]) by mailout06.t-online.de (Postfix) with SMTP id A070D41F8946 for ; Sat, 6 Apr 2019 17:19:11 +0200 (CEST) Received: from [192.168.2.111] (GulRuyZJohNpKoyKCXBsB4B3CcQSZMNM42fqlC3fKcMMe9OklmMoP3KjcFYdd7eQe1@[84.151.181.98]) by fwd09.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1hCn65-1zlvrE0; Sat, 6 Apr 2019 17:19:09 +0200 Subject: Re: Release 2.0.15 ? To: dev@pdfbox.apache.org References: <8f37f5f2-2b41-9a6a-d91a-8d8ecf26aa71@lehmi.de> <4729BFFF-77B0-49ED-AE22-DF2B75FB1624@fileaffairs.de> From: Tilman Hausherr Message-ID: Date: Sat, 6 Apr 2019 17:19:09 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US X-ID: GulRuyZJohNpKoyKCXBsB4B3CcQSZMNM42fqlC3fKcMMe9OklmMoP3KjcFYdd7eQe1 X-TOI-MSGID: e40bfe42-f488-4368-b208-3d1450a6f637 I looked at about 10 files... all are rotated. I suspect this is a result of PDFBOX-4480, that previously some rotated words came as one. But this doesn't matter, the overall extraction of rotated pages would still look bad. For example, the file you mention extracted this in 2.0.14: ... R E R M H IV -1 infection hum an(B 8) [G oulder97c] ... So it had "infection" but the rest was still worthless. The same file extracts nicely with the "rotationMagic" option of ExtractText. Tilman Am 06.04.2019 um 15:50 schrieb Tim Allison: > http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz > > This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though, > there were no content differences btwn 2.0.13 and 2.0.14. I did not > apply angle detection. > > No new exceptions; 2 fixed exceptions. We're getting higher page > counts in a few documents, because we overrode processPages() to > process. Some changes in content, but overall, better, I think, based > on contents/common_token_comparisons_by_mime.xlsx. > > To see where content appears to degrade, open > contents/content_diffs_(no|with)_exceptions, and sort column M > ('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order. Also, look at > columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S > (TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most > frequent tokens that are unique to A or unique to B; from this, it > looks like there is a regression in, e.g. govdocs1/038/038519.pdf, > but, generally (hand waving), it appears that there were word > segmentation problems in both A and B as I look at the results. > > Cheers, > > Tim > > On Fri, Apr 5, 2019 at 10:53 AM Tim Allison wrote: >> +1 I should have regression results by tomorrow >> >> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun wrote: >>> +1 >>> >>>> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler : >>>> >>>> Hi, >>>> >>>> looks like it's time for the next release. How about cutting 2.0.15 next monday? >>>> >>>> WDYT? >>>> >>>> Andreas >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org >>> For additional commands, e-mail: dev-help@pdfbox.apache.org >>> > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: dev-help@pdfbox.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org For additional commands, e-mail: dev-help@pdfbox.apache.org