From dev-return-63443-archive-asf-public=cust-asf.ponee.io@pdfbox.apache.org Fri Jun 14 11:56:34 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 4C87718062F for ; Fri, 14 Jun 2019 13:56:34 +0200 (CEST) Received: (qmail 82630 invoked by uid 500); 14 Jun 2019 11:56:33 -0000 Mailing-List: contact dev-help@pdfbox.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@pdfbox.apache.org Delivered-To: mailing list dev@pdfbox.apache.org Received: (qmail 82619 invoked by uid 99); 14 Jun 2019 11:56:33 -0000 Received: from Unknown (HELO mailrelay1-lw-us.apache.org) (10.10.3.159) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jun 2019 11:56:33 +0000 Received: from mail-yw1-f41.google.com (mail-yw1-f41.google.com [209.85.161.41]) by mailrelay1-lw-us.apache.org (ASF Mail Server at mailrelay1-lw-us.apache.org) with ESMTPSA id C73648B86 for ; Fri, 14 Jun 2019 11:56:32 +0000 (UTC) Received: by mail-yw1-f41.google.com with SMTP id t2so963237ywe.10 for ; Fri, 14 Jun 2019 04:56:32 -0700 (PDT) X-Gm-Message-State: APjAAAUIW6qD4iJ8+X4vJG9J8lH4d2zzVHcInY/T+EeMojHkByaFts65 N/8yiuHuLqeAN2SIJ0ZcRr/7nkrhQH6xtIGEAtE= X-Google-Smtp-Source: APXvYqwKrLPlY/qgMxuM3H7ojeM6wHzhULPTtVbofCe5ECmHWDFz88vHLTICP1dFFWhJcsHyQEvHOIXnTxtbDF/D4lo= X-Received: by 2002:a0d:ddca:: with SMTP id g193mr24218487ywe.55.1560513391700; Fri, 14 Jun 2019 04:56:31 -0700 (PDT) MIME-Version: 1.0 References: <61cccfea-ca1c-1fbf-753a-569d02659a84@lehmi.de> <03176ac2-c2e8-85b7-91b1-7373f043ab0a@t-online.de> In-Reply-To: <03176ac2-c2e8-85b7-91b1-7373f043ab0a@t-online.de> From: Tim Allison Date: Fri, 14 Jun 2019 07:56:20 -0400 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Release 2.0.16 ? To: dev@pdfbox.apache.org Content-Type: multipart/alternative; boundary="000000000000612ed2058b47562d" --000000000000612ed2058b47562d Content-Type: text/plain; charset="UTF-8" Y. Will rerun today. On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr wrote: > Hi, can you run these again? The recent fixed regression in PDFBOX-4550 > resulted in large amounts of files without extraction. > (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B) > > Tilman > > Am 13.06.2019 um 14:36 schrieb Tim Allison: > > All, > > > > On a dev branch, I replaced Optimaize with a dev version of > > OpenNLP's language detector, and I updated the common tokens list to > > cover the 120 langs covered by a dev version of OpenNLP's language > > model. I changed the min token length for common words to 3 (from 4), > > and I'm now using 30k common tokens per lang rather than 20k. > > > > I reran this dev version of tika-eval on PDFBox 2.0.15 vs > > 2.0.16-SNAPSHOT, and the results are here: > > > > http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz > > > > Are there any critical problems with the updates in the contents > > comparison files? Any improvements? > > > > I notice that 'cmn' is the most common category for 'not much actual > > text'...we may want to require a higher confidence in language > > detection before reporting a detected language... > > > > Any and all recommendations are welcomed! Thank you! > > > > Cheers, > > > > Tim > > > > > > > > > > On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler > wrote: > >> Am 12.06.19 um 21:08 schrieb Tilman Hausherr: > >>> Am 12.06.2019 um 03:56 schrieb Tim Allison: > >>>> Reports are available here for 2.0.16-SNAPSHOT: > >>>> > >>>> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz > >>>> > >>>> I haven't had a chance to look yet... > >>> > >>> I did... It's not looking good. It's probably the change in the > ToUnicode stream > >>> parsing, I'll investigate this. > >> I'm going to have a look > >> > >> Andreas > >>> Tilman > >>> > >>> > >>> > >>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison > wrote: > >>>>> +1 > >>>>> > >>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler > wrote: > >>>>>> Hi, > >>>>>> > >>>>>> looks like it's time for the next release. How about cutting 2.0.16 > in about 2 > >>>>>> weeks from now? > >>>>>> > >>>>>> WDYT? > >>>>>> > >>>>>> Andreas > >>>>>> > >>>>>> > --------------------------------------------------------------------- > >>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org > >>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org > >>>>>> > >>>> --------------------------------------------------------------------- > >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org > >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org > >>>> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org > >>> For additional commands, e-mail: dev-help@pdfbox.apache.org > >>> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org > >> For additional commands, e-mail: dev-help@pdfbox.apache.org > >> > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org > > For additional commands, e-mail: dev-help@pdfbox.apache.org > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org > For additional commands, e-mail: dev-help@pdfbox.apache.org > > --000000000000612ed2058b47562d--