www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Kluegl <pklu...@gmail.com>
Subject Re: Training models for OpenNLP on the OntoNotes corpus
Date Fri, 17 Feb 2017 08:06:24 GMT
Hi Joern,


can you share the answer if you get one? I'd really appreciate it :-)


Best,


Peter


Am 09.02.2017 um 15:56 schrieb Joern Kottmann:
> Hello,
>
> right, I agree with you, let me ask them.
>
> Thanks,
> Jörn
>
> On Wed, Feb 8, 2017 at 7:07 AM, Henri Yandell <bayard@apache.org
> <mailto:bayard@apache.org>> wrote:
>
>     The license says:
>
>         "In the event that User's use of the LDC Databases results in
>     the development of a commercial product, User must join...pay
>     fees...".
>
>     While I don't think LDC have necessarily considered Apache's use
>     of their product, and the license text doesn't appear to be
>     considering a situation where the two User definitions are
>     different individuals (ie: Apache the first, our users the
>     second); I don't think it's clear that LDC are in favour of our
>     using their product and you should contact them to get
>     clarification that we can use their product to develop an Apache
>     2.0 licensed product which may subsequently be used in our user's
>     commercial products.
>
>     Hen
>
>     On Tue, Feb 7, 2017 at 7:01 AM, Joern Kottmann <kottmann@gmail.com
>     <mailto:kottmann@gmail.com>> wrote:
>
>         Thanks for your answer!
>
>         We would not distribute the content itself in any way. The
>         training process will reduce the input-copyright protected
>         material into n-grams (which will have at most a length of 2).
>         That work should not be copyright-protect able by the original
>         copyright holder since we don't take anything out that is long
>         enough to be able to be copyright protected.
>
>         There was a case in the EU that might be relevant for this:
>         https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Danske_Dagblades_Forening
>         <https://en.wikipedia.org/wiki/Infopaq_International_A/S_v_Danske_Dagblades_Forening>
>
>         Jörn
>
>         On Mon, Feb 6, 2017 at 5:35 PM, Henri Yandell
>         <bayard@apache.org <mailto:bayard@apache.org>> wrote:
>
>             I don't believe this acceptable.
>
>             It's a non-commercial license that would restrict the uses
>             of the subsequent Apache product.
>
>             Note that the license would also need signing (i.e. it's
>             not something we can use off the shelf).
>
>             One approach would be to contact LDC to let them know our
>             interest in using, but make sure they understand that the
>             output would be going into a product under the Apache 2.0
>             license and that they understand our concern.
>
>             Hen
>
>             On Fri, Feb 3, 2017 at 2:51 AM, Joern Kottmann
>             <joern@apache.org <mailto:joern@apache.org>> wrote:
>
>                 Hello all,
>
>                 the Apache OpenNLP library is a machine learning based
>                 toolkit for the processing of natural language text.It
>                 supports the most common NLP tasks, such as
>                 tokenization, sentence segmentation, part-of-speech
>                 tagging, named entity extraction, chunking and parsing.
>
>                 Many of the competing solutions offer pre-trained
>                 models on various data sources to their users. We came
>                 to the conclusion that we have to do the same to stay
>                 relevant.
>
>                 These corpora we would like to train on usually are
>                 copyright protected or have a license which restrict
>                 the use.
>
>                 I would like to know what the opinion here on
>                 legal-discuss is to train models based on the
>                 OntoNotes corpus [1]. Their license can be found here [2].
>
>                 The training process does the following with the
>                 corpus as input:
>
>                 - Generates string based features (e.g. about word
>                 shape, n-grams, various combinations, etc.), those
>                 features to not contain longer parts of the corpus text
>
>                 - Computes weights for those features based on the corpus
>
>                 The features and weights are stored together in what
>                 we call a model and this model we wish to distribute
>                 under AL 2.0 at Apache OpenNLP.
>
>                 Would it be ok to do that? Are there any concerns?
>
>                 Thanks,
>
>                 Jörn
>
>
>                 [1] https://catalog.ldc.upenn.edu/LDC2013T19
>                 <https://catalog.ldc.upenn.edu/LDC2013T19>
>
>                 [2]
>                 https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf
>                 <https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf>
>
>
>
>
>


Mime
View raw message