opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Colen <william.co...@gmail.com>
Subject Re: Releasing a Language Detection Model
Date Tue, 11 Jul 2017 02:35:50 GMT
Regarding lang detect, we will release one model with +100 languages.
Anyone will be able to reproduce the training or improve according to their
needs. For example, one can reduce the corpus to work only with Latin
languages if that is their need and maybe it can work better in some
applications.

Today we require a model to be used at least by the OpenNLP version that
built it. For example, if a model was created by OpenNLP 1.7.1 we can run
it OpenNLP 1.8.0 but not with 1.6.0. We can keep it that way. I don't see a
reason to update the models every release, but it can help testing (F1,
accuracy etc can't change between between releases).

Not clear to me how the default models would work as well. The idea is not
bad but to make it work properly is hard. I don't think we should handle
this as a library anyway, at least not now.

2017-07-10 22:45 GMT-03:00 <druss@apache.org>:

> +1 for releasing models
>
> as for the rest not sure how I feel.  Is there just one model for the
> Language Detector? I don’t want this to become a versioning issue
> langDect.bin version 1 goes with 1.8.1, but 2 goes with 1.8.2.  Can anyone
> download the Leipzig corpus? Being able to reproduce the model is very
> powerful, because if you have additional data you can add it to the Leipzig
> corpus to improve your model.
>
> I am not a big fan of default models, because it is frustrating as a using
> when unexpected things happen (like if you thing you are telling it to use
> your model, but it uses the default).  However, if the code is verbose
> enough, this is really not a valid concern.  I would want to see the use
> case develop.
> Daniel
>
>
> > On Jul 10, 2017, at 8:58 PM, Aliaksandr Autayeu <aliaksandr@autayeu.com>
> wrote:
> >
> > Great idea!
> >
> > +1 for releasing models.
> >
> > +1 to publish models in jars on Maven Central. This is the fastest way to
> > have somebody started. Moreover, having an extensible mechanism for
> others
> > to do it on their own is really helpful. I did this with extJWNL for
> > packaging WordNet data files. It is also convenient for packaging own
> > custom dictionaries and providing them via repositories. It reuses
> existing
> > infrastructure for things like versioning and distribution. Model
> metadata
> > has to be thought through though. Oh, what a mouthful...
> >
> > +1 for separate download ("no dependency manager" cases)
> >
> > +1 to publish data\scripts\provenance. The more reproducible it is, the
> > better.
> >
> > +1 for some mechanism of loading models from classpath.
> >
> > ~ +1 to maybe explore classpath for a "default" model for API (code) use
> > cases. Perhaps similarly to Dictionary.getDefaultResourceInstance() from
> > extJWNL. But this has to be well thought through as design mistakes here
> > might release some demons from jar hell. I didn't face it, but I'm not
> sure
> > the extJWNL design is best as I didn't do much research on alternatives.
> > And I'd think twice before adding model jars to main binary distribution.
> >
> > +1 to store only the model-building-code in SCM repo. I would not bloat
> the
> > SCM with binaries. Maven repositories though not ideal, are better for
> this
> > than SCM (and there specialized tools like jFrog).
> >
> > ~ -1 about changing CLI to use models from classpath. There was no
> > proposal, but my understanding that it would be some sort of classpath://
> > URL - please correct or clarify. I'd like to see the proposal and use
> cases
> > where it is more convenient than current way of just pointing to the
> file.
> > Perhaps it depends. Our models are already zips with manifests. Jars are
> > zips too. Perhaps changing the model packaging layout to make it more
> > "jar-like" or augmenting it with metadata for searching default models
> from
> > classpath for the above cases of distributing through Maven repositories
> > and loading from code, but perhaps leaving CLI as is - even if your model
> > is technically on the classpath, in most cases you can point to a jar in
> > the file system and thus leave CLI like it is now. It seems that dealing
> > with classpath is more suitable (convenient, safer, customary, ...) for
> > developers fiddling with code than for users fiddling with command-line.
> >
> > +1 for mirroring source corpora. The more reproducible things are the
> > better. But costs (infrastructure) and licenses (this looks like
> > redistribution which is not always allowed) might be the case.
> >
> > I'd also propose to augment model metadata with (optional) information
> > about source corpora, provenance, as much reproduction information as
> > possible, etc. Mostly for easier reproduction and provenance tracking. In
> > my experience I had challenges recalling what y-d-u-en.bin was trained
> on,
> > on which revision of that corpus, which part or subset, language, and
> > whether it had also other annotations (and respective models) for
> > connecting all the possible models from that corpora (e.g.
> > sent-tok-pos-chunk-...).
> >
> > Aliaksandr
> >
> > On 10 July 2017 at 17:41, Jeff Zemerick <jzemerick@apache.org> wrote:
> >
> >> +1 to an opennlp-models jar on Maven Central that contains the models.
> >> +1 to having the models available for download separately (if easily
> >> possible) for users who know what they want.
> >> +1 to having the training data shared somewhere with scripts to generate
> >> the models. It will help protect against losing data as William
> mentioned.
> >> I don't think we should depend on others to reliably host the data. I'll
> >> volunteer to help script the model generation to run on a fleet of EC2
> >> instances if it helps.
> >>
> >> If the user does not provide a model to use on the CLI, can the CLI
> tools
> >> look on the classpath for a model whose name fits the needed model (like
> >> en-ner-person.bin) and if found use it automatically?
> >>
> >> Jeff
> >>
> >>
> >>
> >> On Mon, Jul 10, 2017 at 5:06 PM, Chris Mattmann <mattmann@apache.org>
> >> wrote:
> >>
> >>> +1. In terms of releasing models, maybe an opennlp-models package, and
> >> then
> >>> using Maven structure of src/main/resources/<package prefix dirs>/*.bin
> >> for
> >>> putting the models.
> >>>
> >>> Then using an assembly descriptor to compile the above into a
> *-bin.jar?
> >>>
> >>> Cheers,
> >>> Chris
> >>>
> >>>
> >>>
> >>>
> >>> On 7/10/17, 4:09 PM, "Joern Kottmann" <kottmann@gmail.com> wrote:
> >>>
> >>>    My opinion about this is that we should offer the model as maven
> >>>    dependency for users who just want to use it in their projects, and
> >>>    also offer models for download for people to quickly try out
> OpenNLP.
> >>>    If the models can be downloaded, a new users could very quickly test
> >>>    it via the command line.
> >>>
> >>>    I don't really have any thoughts yet on how we should organize it,
> it
> >>>    would probably be nice to have some place where we can share all the
> >>>    training data, and then have the scripts to produce the models
> >> checked
> >>>    in. It should be easy to retrain all the models in case we do a
> major
> >>>    release.
> >>>
> >>>    In case a corpus is vanishing we should drop support for it, must be
> >>>    obsolete then.
> >>>
> >>>    Jörn
> >>>
> >>>    On Mon, Jul 10, 2017 at 8:50 PM, William Colen <colen@apache.org>
> >>> wrote:
> >>>> We need to address things such as sharing the evaluation results
> >> and
> >>> how to
> >>>> reproduce the training.
> >>>>
> >>>> There are several possibilities for that, but there are points to
> >>> consider:
> >>>>
> >>>> Will we store the model itself in a SCM repository or only the code
> >>> that
> >>>> can build it?
> >>>> Will we deploy the models to a Maven Central repository? It is good
> >>> for
> >>>> people using the Java API but not for command line interface,
> >> should
> >>> we
> >>>> change the CLI to handle models in the classpath?
> >>>> Should we keep a copy of the training model or always download from
> >>> the
> >>>> original provider? We can't guarantee that the corpus will be there
> >>>> forever, not only because it changed license, but simple because
> >> the
> >>>> provider is not keeping the server up anymore.
> >>>>
> >>>> William
> >>>>
> >>>>
> >>>>
> >>>> 2017-07-10 14:52 GMT-03:00 Joern Kottmann <kottmann@gmail.com>:
> >>>>
> >>>>> Hello all,
> >>>>>
> >>>>> since Apache OpenNLP 1.8.1 we have a new language detection
> >>> component
> >>>>> which like all our components has to be trained. I think we should
> >>>>> release a pre-build model for it trained on the Leipzig corpus.
> >> This
> >>>>> will allow the majority of our users to get started very quickly
> >>> with
> >>>>> language detection without the need to figure out on how to train
> >>> it.
> >>>>>
> >>>>> How should this project release models?
> >>>>>
> >>>>> Jörn
> >>>>>
> >>>
> >>>
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message