www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benson Margulies <bimargul...@gmail.com>
Subject Re: Fair-use data in svn
Date Fri, 05 Nov 2010 13:10:49 GMT
It has to be CNN, *and* Reuters, *and* NYT ... and then we start on
languages that aren't English, and then you see how we stay very,
very, busy at my day job.

A model only works on data that you train it on. If you train it on
Wikinews, you get a classifier (or whatever) for ... Wikinews. Sim has
grasped the essental: using limited data, you can certainly prove out
an algorithm. But a school of minnows can't set out to produce an open
source competitor for, say, OpenCalais, unless they can share real
data, lots and lots of real data.


On Fri, Nov 5, 2010 at 8:43 AM, Ross Gardler <rgardler@apache.org> wrote:
> Does it have to be CNN? if it is News you want how about WikiNews?
>
> http://en.wikinews.org/wiki/Main_Page
>
> Ross
>
> Sent from my mobile device.
>
> On 5 Nov 2010, at 06:37, Benson Margulies <bimargulies@gmail.com> wrote:
>
>> Folks,
>>
>> What I think we've established here is that a certain category of NLP
>> tasks can't really be undertaken at Apache in the usual way. I'm not
>> saying that this the end of the world or that it's not worthwhile to
>> try to undertake them in some other way.
>>
>> The NLP research community has 'been there and done that' in terms of
>> trying to clear rights to corpora. It's not necessarily impossible in
>> all cases, but it's not by any means guaranteed to be possible when
>> you need it to be possible.
>>
>> It's an interesting limit, perhaps, on open source: as a commercial
>> enterprise, I use a spider and grab all the visible content of the
>> web, with no regard for copyright, and so long as I don't turn around
>> and publish that text, I have essentially no legal exposure. I can do
>> statistics on it, train models on it, etc. Perhaps a content
>> publisher, if they knew that I had used a large amount of their data,
>> would take issue and ask me to pay something, and then perhaps we'd
>> have a discussion of fair use, or perhaps we'd pay.
>>
>> For the immediate project I'm working on, I'll just push it to github
>> after making my own personal (or corporate) determination of legal
>> risk of being accused of unfair use of a bag of web pages, in a
>> compressed tar file, is in a public source control repository. For the
>> proposed OpenNLP podling, this will put some boundaries on them, but
>> they might be happy to only check in code and 'cleared' corpora, and
>> leave it to their users to apply the code to more interesting corpora.
>>
>> --benson
>>
>>
>> On Fri, Nov 5, 2010 at 5:15 AM, Sim IJskes <sijskes@apache.org> wrote:
>>> On 11/05/2010 09:56 AM, Jukka Zitting wrote:
>>>>
>>>> Hi,
>>>>
>>>> On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes<sijskes@apache.org>  wrote:
>>>>>
>>>>> Wouldn't data publicly accesible in jira be just another case of
>>>>> redistribution? And by this falling within the scope of copyright
>>>>> in many jurisdictions?
>>>>
>>>> Sure, but the "purpose and character" of a Jira attachment is much
>>>> more limited than that of an official Apache release. Plus the need
>>>> for explicitly documenting the licensing status is much more relaxed.
>>>> We have lots of non-licensed Jira attachments that (at least to my
>>>> layman mind) clearly fall within fair use for research purposes.
>>>
>>> I'm a layman;
>>>
>>> Isn't the distinction here that we are not talking about an original
>>> contribution, made by the author, but with an artifact that is nothing more
>>> then an aggregation of public available material? In the jurisdiction i live
>>> under (The Netherlands), this will expose you to legal actions. If you want
>>> to know more, look at the 'Knipselkrant-arrest'.
>>>
>>> Gr. Sim
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>>> For additional commands, e-mail: legal-discuss-help@apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
>> For additional commands, e-mail: legal-discuss-help@apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
> For additional commands, e-mail: legal-discuss-help@apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org


Mime
View raw message