pdfbox-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: Release 2.0.16 ?
Date Fri, 14 Jun 2019 11:56:20 GMT
Y. Will rerun today.

On Fri, Jun 14, 2019 at 12:09 AM Tilman Hausherr <THausherr@t-online.de>
wrote:

> Hi, can you run these again? The recent fixed regression in PDFBOX-4550
> resulted in large amounts of files without extraction.
> (NUM_COMMON_TOKENS_A much larger than NUM_COMMON_TOKENS_B)
>
> Tilman
>
> Am 13.06.2019 um 14:36 schrieb Tim Allison:
> > All,
> >
> >    On a dev branch, I replaced Optimaize with a dev version of
> > OpenNLP's language detector, and I updated the common tokens list to
> > cover the 120 langs covered by a dev version of OpenNLP's language
> > model.  I changed the min token length for common words to 3 (from 4),
> > and I'm now using 30k common tokens per lang rather than 20k.
> >
> >    I reran this dev version of tika-eval on PDFBox 2.0.15 vs
> > 2.0.16-SNAPSHOT, and the results are here:
> >
> > http://162.242.228.174/reports/tika_eval_opennlp_reports.tgz
> >
> >    Are there any critical problems with the updates in the contents
> > comparison files?  Any improvements?
> >
> >    I notice that 'cmn' is the most common category for 'not much actual
> > text'...we may want to require a higher confidence in language
> > detection before reporting a detected language...
> >
> >    Any and all recommendations are welcomed!  Thank you!
> >
> >             Cheers,
> >
> >                         Tim
> >
> >
> >
> >
> > On Thu, Jun 13, 2019 at 12:54 AM Andreas Lehmkuehler <andreas@lehmi.de>
> wrote:
> >> Am 12.06.19 um 21:08 schrieb Tilman Hausherr:
> >>> Am 12.06.2019 um 03:56 schrieb Tim Allison:
> >>>> Reports are available here for 2.0.16-SNAPSHOT:
> >>>>
> >>>> http://162.242.228.174/reports/pdfbox_2_0_16-SNAPSHOT_reports.tgz
> >>>>
> >>>> I haven't had a chance to look yet...
> >>>
> >>> I did... It's not looking good. It's probably the change in the
> ToUnicode stream
> >>> parsing, I'll investigate this.
> >> I'm going to have a look
> >>
> >> Andreas
> >>> Tilman
> >>>
> >>>
> >>>
> >>>> On Sat, Jun 8, 2019 at 9:15 AM Tim Allison <tallison@apache.org>
> wrote:
> >>>>> +1
> >>>>>
> >>>>> On Sat, Jun 8, 2019 at 6:33 AM Andreas Lehmkuehler <andreas@lehmi.de>
> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> looks like it's time for the next release. How about cutting
2.0.16
> in about 2
> >>>>>> weeks from now?
> >>>>>>
> >>>>>> WDYT?
> >>>>>>
> >>>>>> Andreas
> >>>>>>
> >>>>>>
> ---------------------------------------------------------------------
> >>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message