opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <>
Subject Re: Releasing a Language Detection Model
Date Tue, 11 Jul 2017 13:45:21 GMT
Sounds good to me…

On 7/11/17, 9:30 AM, "Joern Kottmann" <> wrote:

    right, very good point, I also think that it is very important to load
    a model in one from the classpath.
    I propose we have the following setup:
    - One jar contains one or multiple model packages (thats the zip container)
    - A model name itself should be kind of unique  e.g. eng-ud-token.bin
    - A user loads the model via: new
    SentenceModel(getClass().getResource("eng-ud-sent.bin")) <- the stream
    gets then closed properly
    Lets take away three things from this discussion:
    1) Store the data in a place where the community can access it
    2) Offer models on our download page similar as it is done today on
    the SourceForge page
    3) Release models packed inside a jar file via maven central
    On Tue, Jul 11, 2017 at 3:00 PM, Aliaksandr Autayeu
    <> wrote:
    > To clarify on models and jars.
    > Putting model inside jar might not be a good idea. I mean here things like
    > bla-bla.jar/en-sent.bin. Our models are already zipped, so they are "jars"
    > already in a sense. We're good. However, current packaging and metadata
    > might not be very classpath friendly.
    > The use case I have in mind is being able to add needed models as
    > dependencies and load them by writing a line of code. For this case having
    > all models in a root with the same name might not be very convenient. Same
    > goes for manifest. The name "" is quite generic and it's
    > not too far-fetched to see some clashes because some other lib also
    > manifests something. It might be better to allow for some flexibility and
    > to adhere to classpath conventions. For example, having manifests in
    > something like org/apache/opennlp/models/ Or
    > opennlp/tools/ And perhaps even allowing to reference a
    > model in the manifest, so the model can be put elsewhere. Just in case
    > there are several custom models of the same kind for different pipelines in
    > the same app. For example, processing queries with one pipeline - one set
    > of models - and processing documents with another pipeline - another set of
    > models. In this case allowing for different classpaths is needed.
    > Perhaps to illustrate my thinking, something like this (which still keeps a
    > lot of possibilities open):
    > en-sent.bin/opennlp/tools/sentdetect/ (perhaps contains
    > a line with something like model =
    > /opennlp/tools/sentdetect/model/sent.model)
    > en-sent.bin/opennlp/tools/sentdetect/model/sent.model
    > This allows including en-sent.bin as dependency. And then doing something
    > like
    > SentenceModel sdm = SentenceModel.getDefaultResourceModel(); // if we want
    > default models in this way. Seems verbose enough to allow for some safety
    > through explicitness. That's if we want any defaults at all.
    > Or something like:
    > SentenceModel sdm =
    > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/");
    > Or
    > SentenceModel sdm =
    > SentenceModel.getResourceModel("/opennlp/tools/sentdetect/model/sent.model");
    > Or more in-line with a current style:
    > SentenceModel sdm = new
    > SentenceModel("/opennlp/tools/sentdetect/model/sent.model"); // though here
    > we commit to interpreting String as classpath reference. That's why I'd
    > prefer more explicit method names.
    > Or leave dealing with resources to the users, leave current code intact and
    > provide only packaging and distribution:
    > SentenceModel sdm = new
    > SentenceModel(this.getClass().getResourceAsStream("/.../.../manifest or
    > model"));
    > And to add to model metadata also F1\accuracy (at least CV-based, for
    > example 10-fold) for quick reference or quick understanding of what that
    > model is capable of. Could be helpful for those with a bunch of models
    > around. And for others as well to have a better insight about the model in
    > question.
    > On 11 July 2017 at 06:37, Chris Mattmann <> wrote:
    >> Hi,
    >> FWIW, I’ve seen CLI tools – lots in my day – that can load from the CLI
    >> override an
    >> internal classpath dependency. This is for people in environments who want
    >> a sensible
    >> / delivered internal classpath default and the ability for run-time, non
    >> zipped up/messing
    >> with JAR file override. Think about people who are using OpenNLP in both
    >> Java/Python
    >> environments as an example.
    >> Cheers,
    >> Chris
    >> On 7/11/17, 3:25 AM, "Joern Kottmann" <> wrote:
    >>     I would not change the CLI to load models from jar files. I never used
    >>     or saw a command line tool that expects a file as an input and would
    >>     then also load it from inside a jar file. It will be hard to
    >>     communicate how that works precisely in the CLI usage texts and this
    >>     is not a feature anyone would expect to be there. The intention of the
    >>     CLI is to give users the ability to quickly test OpenNLP before they
    >>     integrate it into their software and to train and evaluate models
    >>     Users who for some reason have a jar file with a model inside can just
    >>     write "unzip model.jar".
    >>     After all I think this is quite  a bit of complexity we would need to
    >>     add for it and it will have very limited use.
    >>     The use case of publishing jar files is to make the models easily
    >>     available to people who have a build system with dependency
    >>     management, they won't have to download models manually, and when they
    >>     update OpenNLP then can also update the models with a version string
    >>     change.
    >>     For the command line "quick start" use case we should offer the models
    >>     on a download page as we do today. This page could list both, the
    >>     download link and the maven dependency.
    >>     Jörn
    >>     On Mon, Jul 10, 2017 at 8:50 PM, William Colen <>
    >> wrote:
    >>     > We need to address things such as sharing the evaluation results and
    >> how to
    >>     > reproduce the training.
    >>     >
    >>     > There are several possibilities for that, but there are points to
    >> consider:
    >>     >
    >>     > Will we store the model itself in a SCM repository or only the code
    >> that
    >>     > can build it?
    >>     > Will we deploy the models to a Maven Central repository? It is good
    >> for
    >>     > people using the Java API but not for command line interface, should
    >> we
    >>     > change the CLI to handle models in the classpath?
    >>     > Should we keep a copy of the training model or always download from
    >> the
    >>     > original provider? We can't guarantee that the corpus will be there
    >>     > forever, not only because it changed license, but simple because the
    >>     > provider is not keeping the server up anymore.
    >>     >
    >>     > William
    >>     >
    >>     >
    >>     >
    >>     > 2017-07-10 14:52 GMT-03:00 Joern Kottmann <>:
    >>     >
    >>     >> Hello all,
    >>     >>
    >>     >> since Apache OpenNLP 1.8.1 we have a new language detection
    >> component
    >>     >> which like all our components has to be trained. I think we should
    >>     >> release a pre-build model for it trained on the Leipzig corpus.
    >>     >> will allow the majority of our users to get started very quickly
    >> with
    >>     >> language detection without the need to figure out on how to train
    >> it.
    >>     >>
    >>     >> How should this project release models?
    >>     >>
    >>     >> Jörn
    >>     >>

View raw message