www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joern Kottmann <kottm...@gmail.com>
Subject Re: Training models for OpenNLP on the OntoNotes corpus
Date Thu, 09 Feb 2017 14:56:50 GMT
Hello,

right, I agree with you, let me ask them.

Thanks,
Jörn

On Wed, Feb 8, 2017 at 7:07 AM, Henri Yandell <bayard@apache.org> wrote:

> The license says:
>
>     "In the event that User's use of the LDC Databases results in the
> development of a commercial product, User must join...pay fees...".
>
> While I don't think LDC have necessarily considered Apache's use of their
> product, and the license text doesn't appear to be considering a situation
> where the two User definitions are different individuals (ie: Apache the
> first, our users the second); I don't think it's clear that LDC are in
> favour of our using their product and you should contact them to get
> clarification that we can use their product to develop an Apache 2.0
> licensed product which may subsequently be used in our user's commercial
> products.
>
> Hen
>
> On Tue, Feb 7, 2017 at 7:01 AM, Joern Kottmann <kottmann@gmail.com> wrote:
>
>> Thanks for your answer!
>>
>> We would not distribute the content itself in any way. The training
>> process will reduce the input-copyright protected material into n-grams
>> (which will have at most a length of 2). That work should not be
>> copyright-protect able by the original copyright holder since we don't take
>> anything out that is long enough to be able to be copyright protected.
>>
>> There was a case in the EU that might be relevant for this:
>> https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Da
>> nske_Dagblades_Forening
>>
>> Jörn
>>
>> On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell <bayard@apache.org> wrote:
>>
>>> I don't believe this acceptable.
>>>
>>> It's a non-commercial license that would restrict the uses of the
>>> subsequent Apache product.
>>>
>>> Note that the license would also need signing (i.e. it's not something
>>> we can use off the shelf).
>>>
>>> One approach would be to contact LDC to let them know our interest in
>>> using, but make sure they understand that the output would be going into a
>>> product under the Apache 2.0 license and that they understand our concern.
>>>
>>> Hen
>>>
>>> On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann <joern@apache.org> wrote:
>>>
>>>> Hello all,
>>>>
>>>> the Apache OpenNLP library is a machine learning based toolkit for the
>>>> processing of natural language text.It supports the most common NLP tasks,
>>>> such as tokenization, sentence segmentation, part-of-speech tagging, named
>>>> entity extraction, chunking and parsing.
>>>>
>>>> Many of the competing solutions offer pre-trained models on various
>>>> data sources to their users. We came to the conclusion that we have to do
>>>> the same to stay relevant.
>>>>
>>>> These corpora we would like to train on usually are copyright protected
>>>> or have a license which restrict the use.
>>>>
>>>> I would like to know what the opinion here on legal-discuss is to train
>>>> models based on the OntoNotes corpus [1]. Their license can be found here
>>>> [2].
>>>>
>>>> The training process does the following with the corpus as input:
>>>>
>>>> - Generates string based features (e.g. about word shape, n-grams,
>>>> various combinations, etc.), those features to not contain longer parts of
>>>> the corpus text
>>>>
>>>> - Computes weights for those features based on the corpus
>>>>
>>>> The features and weights are stored together in what we call a model
>>>> and this model we wish to distribute under AL 2.0 at Apache OpenNLP.
>>>>
>>>> Would it be ok to do that? Are there any concerns?
>>>>
>>>> Thanks,
>>>>
>>>> Jörn
>>>>
>>>>
>>>> [1] https://catalog.ldc.upenn.edu/LDC2013T19
>>>>
>>>> [2] https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
>>>>
>>>
>>>
>>
>

Mime
View raw message