opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Colen <william.co...@gmail.com>
Subject Re: Releasing a Language Detection Model
Date Tue, 11 Jul 2017 13:33:18 GMT
+1


2017-07-11 10:30 GMT-03:00 Joern Kottmann <kottmann@gmail.com>:

> Hello,
>
> right, very good point, I also think that it is very important to load
> a model in one from the classpath.
>
> I propose we have the following setup:
> - One jar contains one or multiple model packages (thats the zip container)
> - A model name itself should be kind of unique  e.g. eng-ud-token.bin
> - A user loads the model via: new
> SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
> gets then closed properly
>
>
> Lets take away three things from this discussion:
> 1) Store the data in a place where the community can access it
> 2) Offer models on our download page similar as it is done today on
> the SourceForge page
> 3) Release models packed inside a jar file via maven central
>
> Jörn
>
>
>
>
>
>
>
> On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
> <aliaksandr@autayeu.com> wrote:
> > To clarify on models and jars.
> >
> > Putting model inside jar might not be a good idea. I mean here things
> like
> > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are
> "jars"
> > already in a sense. We're good. However, current packaging and metadata
> > might not be very classpath friendly.
> >
> > The use case I have in mind is being able to add needed models as
> > dependencies and load them by writing a line of code. For this case
> having
> > all models in a root with the same name might not be very convenient.
> Same
> > goes for manifest. The name "manifest.properties" is quite generic and
> it's
> > not too far-fetched to see some clashes because some other lib also
> > manifests something. It might be better to allow for some flexibility and
> > to adhere to classpath conventions. For example, having manifests in
> > something like org/apache/opennlp/models/manifest.properties. Or
> > opennlp/tools/manifest.properties. And perhaps even allowing to
> reference a
> > model in the manifest, so the model can be put elsewhere. Just in case
> > there are several custom models of the same kind for different pipelines
> in
> > the same app. For example, processing queries with one pipeline - one set
> > of models - and processing documents with another pipeline - another set
> of
> > models. In this case allowing for different classpaths is needed.
> >
> > Perhaps to illustrate my thinking, something like this (which still
> keeps a
> > lot of possibilities open):
> > en-sent.bin/opennlp/tools/sentdetect/manifest.properties (perhaps
> contains
> > a line with something like model =
> > /opennlp/tools/sentdetect/model/sent.model)
> > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
> >
> > This allows including en-sent.bin as dependency. And then doing something
> > like
> > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we
> want
> > default models in this way. Seems verbose enough to allow for some safety
> > through explicitness. That's if we want any defaults at all.
> > Or something like:
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/manifest.
> properties");
> > Or
> > SentenceModel sdm =
> > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.
> model");
> > Or more in-line with a current style:
> > SentenceModel sdm = new
> > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though
> here
> > we commit to interpreting String as classpath reference. That's why I'd
> > prefer more explicit method names.
> > Or leave dealing with resources to the users, leave current code intact
> and
> > provide only packaging and distribution:
> > SentenceModel sdm = new
> > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
> > model"));
> >
> >
> > And to add to model metadata also F1\accuracy (at least CV-based, for
> > example 10-fold) for quick reference or quick understanding of what that
> > model is capable of. Could be helpful for those with a bunch of models
> > around. And for others as well to have a better insight about the model
> in
> > question.
> >
> >
> >
> > On 11 July 2017 at 06:37, Chris Mattmann <mattmann@apache.org> wrote:
> >
> >> Hi,
> >>
> >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI
> to
> >> override an
> >> internal classpath dependency. This is for people in environments who
> want
> >> a sensible
> >> / delivered internal classpath default and the ability for run-time, non
> >> zipped up/messing
> >> with JAR file override. Think about people who are using OpenNLP in both
> >> Java/Python
> >> environments as an example.
> >>
> >> Cheers,
> >> Chris
> >>
> >>
> >>
> >>
> >> On 7/11/17, 3:25 AM, "Joern Kottmann" <kottmann@gmail.com> wrote:
> >>
> >>     I would not change the CLI to load models from jar files. I never
> used
> >>     or saw a command line tool that expects a file as an input and would
> >>     then also load it from inside a jar file. It will be hard to
> >>     communicate how that works precisely in the CLI usage texts and this
> >>     is not a feature anyone would expect to be there. The intention of
> the
> >>     CLI is to give users the ability to quickly test OpenNLP before they
> >>     integrate it into their software and to train and evaluate models
> >>
> >>     Users who for some reason have a jar file with a model inside can
> just
> >>     write "unzip model.jar".
> >>
> >>     After all I think this is quite  a bit of complexity we would need
> to
> >>     add for it and it will have very limited use.
> >>
> >>     The use case of publishing jar files is to make the models easily
> >>     available to people who have a build system with dependency
> >>     management, they won't have to download models manually, and when
> they
> >>     update OpenNLP then can also update the models with a version string
> >>     change.
> >>
> >>     For the command line "quick start" use case we should offer the
> models
> >>     on a download page as we do today. This page could list both, the
> >>     download link and the maven dependency.
> >>
> >>     Jörn
> >>
> >>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <colen@apache.org>
> >> wrote:
> >>     > We need to address things such as sharing the evaluation results
> and
> >> how to
> >>     > reproduce the training.
> >>     >
> >>     > There are several possibilities for that, but there are points to
> >> consider:
> >>     >
> >>     > Will we store the model itself in a SCM repository or only the
> code
> >> that
> >>     > can build it?
> >>     > Will we deploy the models to a Maven Central repository? It is
> good
> >> for
> >>     > people using the Java API but not for command line interface,
> should
> >> we
> >>     > change the CLI to handle models in the classpath?
> >>     > Should we keep a copy of the training model or always download
> from
> >> the
> >>     > original provider? We can't guarantee that the corpus will be
> there
> >>     > forever, not only because it changed license, but simple because
> the
> >>     > provider is not keeping the server up anymore.
> >>     >
> >>     > William
> >>     >
> >>     >
> >>     >
> >>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <kottmann@gmail.com>:
> >>     >
> >>     >> Hello all,
> >>     >>
> >>     >> since Apache OpenNLP 1.8.1 we have a new language detection
> >> component
> >>     >> which like all our components has to be trained. I think we
> should
> >>     >> release a pre-build model for it trained on the Leipzig corpus.
> This
> >>     >> will allow the majority of our users to get started very quickly
> >> with
> >>     >> language detection without the need to figure out on how to train
> >> it.
> >>     >>
> >>     >> How should this project release models?
> >>     >>
> >>     >> Jörn
> >>     >>
> >>
> >>
> >>
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message