www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Mattmann <mattm...@apache.org>
Subject Re: Training models for OpenNLP on the Universial Dependency corpus
Date Fri, 19 May 2017 14:19:28 GMT
Thanks Jorn. I hope to take this on and help drive to a decision on this
promptly.

 

Cheers,

Chris

 

 

 

 

From: Joern Kottmann <kottmann@gmail.com>
Reply-To: "legal-discuss@apache.org" <legal-discuss@apache.org>
Date: Friday, May 19, 2017 at 4:13 AM
To: "legal-discuss@apache.org" <legal-discuss@apache.org>
Subject: Re: Training models for OpenNLP on the Universial Dependency corpus

 

Hello, 

 

I opened an issue for this:

https://issues.apache.org/jira/browse/LEGAL-309

 

We would like to get this resolved soon. The project just releases OpenNLP 1.8.0 yesterday,
and now we would like to release pre-trained models as well, but to be able to do that we
would need to resolve this first.

 

Jörn

 

On Wed, May 10, 2017 at 4:31 PM, Joern Kottmann <kottmann@gmail.com> wrote:

Hello all, 

 

we already had this discussion for OntoNotes [1] and I would like to know how the case is
for the Universal Dependency [2] corpus. 

 

The OpenNLP project develops statistical natural language processing software which needs
to be trained in order to produce a model that can be used to perform one of our supported
tasks such as part-of-speech tagging or lemmatization.

 

We would like to know if it would be possible to train models on data included in UD which
itself is licensed under various Creative Commons licenses (e.g. CC BY-NC 3.0/4.0, CC BY-SA
4.0, CC BY 4.0), GPL and others, and then license the trained model under AL 2.0.

 

If you go to [2] you can see a list of data files and their license.

 

As far as we understand those licenses don't explicitly disallow using the content for training
models as it is the case with the OntoNotes LDC license.

 

The models we would like to train on that data are:

- Part-of-Speech models (contains bigrams and a set of individual words of the training text)

- Lemmatizer (contains a set of individual words of the training text)

 

Jörn

 

[1] http://mail-archives.apache.org/mod_mbox/www-legal-discuss/201702.mbox/%3CCA%2BV%3DWqhEsBWDb%2BQ%2BaEkjfO_FmGoPx2yGiw2oHYjQrWpaUGmoNw%40mail.gmail.com%3E

[2] http://universaldependencies.org/

 

 


Mime
View raw message