www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joern Kottmann <kottm...@gmail.com>
Subject Training models for OpenNLP on the Universial Dependency corpus
Date Wed, 10 May 2017 14:31:21 GMT
Hello all,

we already had this discussion for OntoNotes [1] and I would like to know
how the case is for the Universal Dependency [2] corpus.

The OpenNLP project develops statistical natural language processing
software which needs to be trained in order to produce a model that can be
used to perform one of our supported tasks such as part-of-speech tagging
or lemmatization.

We would like to know if it would be possible to train models on data
included in UD which itself is licensed under various Creative Commons
licenses (e.g. CC BY-NC 3.0/4.0, CC BY-SA 4.0, CC BY 4.0), GPL and others,
and then license the trained model under AL 2.0.

If you go to [2] you can see a list of data files and their license.

As far as we understand those licenses don't explicitly disallow using the
content for training models as it is the case with the OntoNotes LDC

The models we would like to train on that data are:
- Part-of-Speech models (contains bigrams and a set of individual words of
the training text)
- Lemmatizer (contains a set of individual words of the training text)


[2] http://universaldependencies.org/

View raw message