opennlp-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Baldridge <>
Subject Re: Could we please have a jar file for models and a maven dependency?
Date Mon, 16 Apr 2012 14:27:01 GMT
English models were trained on the Penn Treebank. You should consider those
all frozen -- any further model access and development will be done and
posted on the OpenNLP Models github project. Any help with getting truly
open data curated and annotated is most welcome!


On Mon, Apr 16, 2012 at 6:49 AM, Jeyendran Balakrishnan <
> wrote:

> >> Well, we agree, but we cannot publish copyright protected training data
> such as MUC 6/7, ACE, etc. Thats why we currently mostly focus on sharing
> the code which is necessary to work with these data sets.
> >> And data sets which can be distributed in some way under a restrictive
> (not AL compatible) license are published in the github project.
> I agree about the issues concerning publishing of copyrighted training
> data.
> Will it be possible, for each trained model, to clearly state the source
> for
> the corresponding training dataset, along with the corresponding URL? Then
> individual users can follow the link to check if the associated training
> data set requires special licensing requirements, and potentially make
> arrangements with copyright holders to get access for the same. Currently,
> for most models listed in the models section at OpenNLP at SourceForge
> (, it is not clear what the
> original training data is, or where it may be found.
> For example, for English POS tagging, and sentence detection, the middle
> column says things like "Trained on opennlp training data" and "Maxent
> model
> with tag dictionary", but does not indicate where the original training
> data
> may be found.
> >> What we have to do in the end is to start a community labeling project
> on
> texts which can be licensed under an Open Source license.
> >> We started to work on the tooling for the community labeling project,
> but
> are progressing very slowly, because we do not have enough resources to
> write all the tooling.
> >> I am using a the existing stuff for work related projects and are able
> to
> contribute bug fixes and improvements back.
> >> Jörn
> -----Original Message-----
> From: Jörn Kottmann []
> Sent: Monday, April 16, 2012 4:27 AM
> To:
> Subject: Re: Could we please have a jar file for models and a maven
> dependency?
> On 04/16/2012 01:13 PM, Jeyendran Balakrishnan wrote:
> > The github project for distributing model files sounds like a great idea.
> >
> > It would also be very useful to get an authoritative list (with name,
> > description, and especially URL) of the training data files used to
> > generate each of the trained models.
> > Especially for models trained using OpenNLP training data, it is not
> > clear where the training data files are available.
> > By making the training data files available, OpenNLP can enable users
> > to augment them by adding their own training samples and retrain on
> > the augment data set.
> > Retraining would help significantly either in improving accuracy in
> > different problem domains (e.g., blog articles compared to newspaper
> > articles, etc) or covering for corner cases missed by the original
> > training data. Having the original training data will help
> > immeasurably since it will be much more manageable for users to merely
> > add their own training samples, compared to generating and annotating all
> the original training samples.
> >
> > Any thoughts on this?
> >

Jason Baldridge
Associate Professor, Department of Linguistics
The University of Texas at Austin

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message