www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joern Kottmann <kottm...@gmail.com>
Subject Re: Training models for OpenNLP on the OntoNotes corpus
Date Tue, 07 Feb 2017 15:01:35 GMT
Thanks for your answer!

We would not distribute the content itself in any way. The training process
will reduce the input-copyright protected material into n-grams (which will
have at most a length of 2). That work should not be copyright-protect able
by the original copyright holder since we don't take anything out that is
long enough to be able to be copyright protected.

There was a case in the EU that might be relevant for this:


On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell <bayard@apache.org> wrote:

> I don't believe this acceptable.
> It's a non-commercial license that would restrict the uses of the
> subsequent Apache product.
> Note that the license would also need signing (i.e. it's not something we
> can use off the shelf).
> One approach would be to contact LDC to let them know our interest in
> using, but make sure they understand that the output would be going into a
> product under the Apache 2.0 license and that they understand our concern.
> Hen
> On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann <joern@apache.org> wrote:
>> Hello all,
>> the Apache OpenNLP library is a machine learning based toolkit for the
>> processing of natural language text.It supports the most common NLP tasks,
>> such as tokenization, sentence segmentation, part-of-speech tagging, named
>> entity extraction, chunking and parsing.
>> Many of the competing solutions offer pre-trained models on various data
>> sources to their users. We came to the conclusion that we have to do the
>> same to stay relevant.
>> These corpora we would like to train on usually are copyright protected
>> or have a license which restrict the use.
>> I would like to know what the opinion here on legal-discuss is to train
>> models based on the OntoNotes corpus [1]. Their license can be found here
>> [2].
>> The training process does the following with the corpus as input:
>> - Generates string based features (e.g. about word shape, n-grams,
>> various combinations, etc.), those features to not contain longer parts of
>> the corpus text
>> - Computes weights for those features based on the corpus
>> The features and weights are stored together in what we call a model and
>> this model we wish to distribute under AL 2.0 at Apache OpenNLP.
>> Would it be ok to do that? Are there any concerns?
>> Thanks,
>> Jörn
>> [1] https://catalog.ldc.upenn.edu/LDC2013T19
>> [2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf

View raw message