On 05-11-10 11:37, Benson Margulies wrote:
> Folks,
>
> What I think we've established here is that a certain category of NLP
> tasks can't really be undertaken at Apache in the usual way. I'm not
> saying that this the end of the world or that it's not worthwhile to
> try to undertake them in some other way.
>
> The NLP research community has 'been there and done that' in terms of
> trying to clear rights to corpora. It's not necessarily impossible in
> all cases, but it's not by any means guaranteed to be possible when
> you need it to be possible.
>
> It's an interesting limit, perhaps, on open source: as a commercial
> enterprise, I use a spider and grab all the visible content of the
> web, with no regard for copyright, and so long as I don't turn around
> and publish that text, I have essentially no legal exposure. I can do
> statistics on it, train models on it, etc. Perhaps a content
> publisher, if they knew that I had used a large amount of their data,
> would take issue and ask me to pay something, and then perhaps we'd
> have a discussion of fair use, or perhaps we'd pay.
>
> For the immediate project I'm working on, I'll just push it to github
> after making my own personal (or corporate) determination of legal
> risk of being accused of unfair use of a bag of web pages, in a
> compressed tar file, is in a public source control repository. For the
> proposed OpenNLP podling, this will put some boundaries on them, but
> they might be happy to only check in code and 'cleared' corpora, and
> leave it to their users to apply the code to more interesting corpora.
You could scrape the urls of:
http://wiki.creativecommons.org/Books
And classify them manually, and put these into your dataset.
Or limit your crawler to
http://www.gutenberg.org/wiki/Main_Page
Gr. Sim
---------------------------------------------------------------------
To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org
For additional commands, e-mail: legal-discuss-help@apache.org
|