www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joern Kottmann <kottm...@gmail.com>
Subject Re: Training models for OpenNLP on the OntoNotes corpus
Date Tue, 07 Feb 2017 15:01:35 GMT
Thanks for your answer!

We would not distribute the content itself in any way. The training process
will reduce the input-copyright protected material into n-grams (which will
have at most a length of 2). That work should not be copyright-protect able
by the original copyright holder since we don't take anything out that is
long enough to be able to be copyright protected.

There was a case in the EU that might be relevant for this:
https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Danske_Dagblades_Forening

Jörn

On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell <bayard@apache.org> wrote:

> I don't believe this acceptable.
>
> It's a non-commercial license that would restrict the uses of the
> subsequent Apache product.
>
> Note that the license would also need signing (i.e. it's not something we
> can use off the shelf).
>
> One approach would be to contact LDC to let them know our interest in
> using, but make sure they understand that the output would be going into a
> product under the Apache 2.0 license and that they understand our concern.
>
> Hen
>
> On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann <joern@apache.org> wrote:
>
>> Hello all,
>>
>> the Apache OpenNLP library is a machine learning based toolkit for the
>> processing of natural language text.It supports the most common NLP tasks,
>> such as tokenization, sentence segmentation, part-of-speech tagging, named
>> entity extraction, chunking and parsing.
>>
>> Many of the competing solutions offer pre-trained models on various data
>> sources to their users. We came to the conclusion that we have to do the
>> same to stay relevant.
>>
>> These corpora we would like to train on usually are copyright protected
>> or have a license which restrict the use.
>>
>> I would like to know what the opinion here on legal-discuss is to train
>> models based on the OntoNotes corpus [1]. Their license can be found here
>> [2].
>>
>> The training process does the following with the corpus as input:
>>
>> - Generates string based features (e.g. about word shape, n-grams,
>> various combinations, etc.), those features to not contain longer parts of
>> the corpus text
>>
>> - Computes weights for those features based on the corpus
>>
>> The features and weights are stored together in what we call a model and
>> this model we wish to distribute under AL 2.0 at Apache OpenNLP.
>>
>> Would it be ok to do that? Are there any concerns?
>>
>> Thanks,
>>
>> Jörn
>>
>>
>> [1] https://catalog.ldc.upenn.edu/LDC2013T19
>>
>> [2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
>>
>
>

Mime
View raw message