Does it have to be CNN? if it is News you want how about WikiNews? http://en.wikinews.org/wiki/Main_Page Ross Sent from my mobile device. On 5 Nov 2010, at 06:37, Benson Margulies wrote: > Folks, > > What I think we've established here is that a certain category of NLP > tasks can't really be undertaken at Apache in the usual way. I'm not > saying that this the end of the world or that it's not worthwhile to > try to undertake them in some other way. > > The NLP research community has 'been there and done that' in terms of > trying to clear rights to corpora. It's not necessarily impossible in > all cases, but it's not by any means guaranteed to be possible when > you need it to be possible. > > It's an interesting limit, perhaps, on open source: as a commercial > enterprise, I use a spider and grab all the visible content of the > web, with no regard for copyright, and so long as I don't turn around > and publish that text, I have essentially no legal exposure. I can do > statistics on it, train models on it, etc. Perhaps a content > publisher, if they knew that I had used a large amount of their data, > would take issue and ask me to pay something, and then perhaps we'd > have a discussion of fair use, or perhaps we'd pay. > > For the immediate project I'm working on, I'll just push it to github > after making my own personal (or corporate) determination of legal > risk of being accused of unfair use of a bag of web pages, in a > compressed tar file, is in a public source control repository. For the > proposed OpenNLP podling, this will put some boundaries on them, but > they might be happy to only check in code and 'cleared' corpora, and > leave it to their users to apply the code to more interesting corpora. > > --benson > > > On Fri, Nov 5, 2010 at 5:15 AM, Sim IJskes wrote: >> On 11/05/2010 09:56 AM, Jukka Zitting wrote: >>> >>> Hi, >>> >>> On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes wrote: >>>> >>>> Wouldn't data publicly accesible in jira be just another case of >>>> redistribution? And by this falling within the scope of copyright >>>> in many jurisdictions? >>> >>> Sure, but the "purpose and character" of a Jira attachment is much >>> more limited than that of an official Apache release. Plus the need >>> for explicitly documenting the licensing status is much more relaxed. >>> We have lots of non-licensed Jira attachments that (at least to my >>> layman mind) clearly fall within fair use for research purposes. >> >> I'm a layman; >> >> Isn't the distinction here that we are not talking about an original >> contribution, made by the author, but with an artifact that is nothing more >> then an aggregation of public available material? In the jurisdiction i live >> under (The Netherlands), this will expose you to legal actions. If you want >> to know more, look at the 'Knipselkrant-arrest'. >> >> Gr. Sim >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org >> For additional commands, e-mail: legal-discuss-help@apache.org >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org > For additional commands, e-mail: legal-discuss-help@apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org For additional commands, e-mail: legal-discuss-help@apache.org