www-legal-discuss mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sim IJskes <sijs...@apache.org>
Subject Re: Fair-use data in svn
Date Fri, 05 Nov 2010 11:07:50 GMT
On 05-11-10 11:37, Benson Margulies wrote:
> Folks,
> What I think we've established here is that a certain category of NLP
> tasks can't really be undertaken at Apache in the usual way. I'm not
> saying that this the end of the world or that it's not worthwhile to
> try to undertake them in some other way.
> The NLP research community has 'been there and done that' in terms of
> trying to clear rights to corpora. It's not necessarily impossible in
> all cases, but it's not by any means guaranteed to be possible when
> you need it to be possible.
> It's an interesting limit, perhaps, on open source: as a commercial
> enterprise, I use a spider and grab all the visible content of the
> web, with no regard for copyright, and so long as I don't turn around
> and publish that text, I have essentially no legal exposure. I can do
> statistics on it, train models on it, etc. Perhaps a content
> publisher, if they knew that I had used a large amount of their data,
> would take issue and ask me to pay something, and then perhaps we'd
> have a discussion of fair use, or perhaps we'd pay.
> For the immediate project I'm working on, I'll just push it to github
> after making my own personal (or corporate) determination of legal
> risk of being accused of unfair use of a bag of web pages, in a
> compressed tar file, is in a public source control repository. For the
> proposed OpenNLP podling, this will put some boundaries on them, but
> they might be happy to only check in code and 'cleared' corpora, and
> leave it to their users to apply the code to more interesting corpora.

You could scrape the urls of:


And classify them manually, and put these into your dataset.

Or limit your crawler to


Gr. Sim

To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org

View raw message