From legal-discuss-return-6011-apmail-legal-discuss-archive=apache.org@apache.org Fri Nov 05 13:10:53 2010 Return-Path: Delivered-To: apmail-legal-discuss-archive@www.apache.org Received: (qmail 4878 invoked from network); 5 Nov 2010 13:10:51 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 5 Nov 2010 13:10:51 -0000 Received: (qmail 90479 invoked by uid 500); 5 Nov 2010 13:11:21 -0000 Delivered-To: apmail-legal-discuss-archive@apache.org Received: (qmail 90075 invoked by uid 500); 5 Nov 2010 13:11:18 -0000 Mailing-List: contact legal-discuss-help@apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: Reply-To: legal-discuss@apache.org List-Id: Delivered-To: mailing list legal-discuss@apache.org Received: (qmail 90068 invoked by uid 99); 5 Nov 2010 13:11:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Nov 2010 13:11:17 +0000 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of bimargulies@gmail.com designates 209.85.214.50 as permitted sender) Received: from [209.85.214.50] (HELO mail-bw0-f50.google.com) (209.85.214.50) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Nov 2010 13:11:10 +0000 Received: by bwz17 with SMTP id 17so2761424bwz.23 for ; Fri, 05 Nov 2010 06:10:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=ZpdHTj6OGQ2qbGhTzrRh78tXYbTdtdyIlQCnQ+DUhnY=; b=h/V19CqdIfitxeSRci8qnXleqPQX5s+Bk1Hn6W7Cys/F8c/M3od9wdSuPuCKKBfN05 sMM54BX0ZxYJKcjjYKoDk/1uLEHsKgbg/kOAF9PPfKM6mdg9/vOK1MJKVad1rpp9ULKV ySC7vchrnnk9fKY0gpResB3F5bfhFFyLJuZow= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=TiR2zpu0zpqzTWApXG2i1eSxzixLKtBNncw65cxUJ1JeI/tsscPCALSq6SdbNZ0vuw aEOTAf9XMz9foMfayYFq1LXz6+dt4CmPxv68cE5PA7MXAAmjVpF1Qkyb6fbFzwFLWq+j F6l6nuJjkXm+XwQYEr8ApjlL6W3EEjoK7EccM= MIME-Version: 1.0 Received: by 10.204.64.139 with SMTP id e11mr1828047bki.212.1288962649954; Fri, 05 Nov 2010 06:10:49 -0700 (PDT) Received: by 10.204.78.79 with HTTP; Fri, 5 Nov 2010 06:10:49 -0700 (PDT) In-Reply-To: <065FE36F-B330-4F43-8194-5FAB58BFEEB0@apache.org> References: <4CD3BB4A.2000406@apache.org> <4CD3CB3F.2020303@apache.org> <065FE36F-B330-4F43-8194-5FAB58BFEEB0@apache.org> Date: Fri, 5 Nov 2010 09:10:49 -0400 Message-ID: Subject: Re: Fair-use data in svn From: Benson Margulies To: legal-discuss@apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org It has to be CNN, *and* Reuters, *and* NYT ... and then we start on languages that aren't English, and then you see how we stay very, very, busy at my day job. A model only works on data that you train it on. If you train it on Wikinews, you get a classifier (or whatever) for ... Wikinews. Sim has grasped the essental: using limited data, you can certainly prove out an algorithm. But a school of minnows can't set out to produce an open source competitor for, say, OpenCalais, unless they can share real data, lots and lots of real data. On Fri, Nov 5, 2010 at 8:43 AM, Ross Gardler wrote: > Does it have to be CNN? if it is News you want how about WikiNews? > > http://en.wikinews.org/wiki/Main_Page > > Ross > > Sent from my mobile device. > > On 5 Nov 2010, at 06:37, Benson Margulies wrote: > >> Folks, >> >> What I think we've established here is that a certain category of NLP >> tasks can't really be undertaken at Apache in the usual way. I'm not >> saying that this the end of the world or that it's not worthwhile to >> try to undertake them in some other way. >> >> The NLP research community has 'been there and done that' in terms of >> trying to clear rights to corpora. It's not necessarily impossible in >> all cases, but it's not by any means guaranteed to be possible when >> you need it to be possible. >> >> It's an interesting limit, perhaps, on open source: as a commercial >> enterprise, I use a spider and grab all the visible content of the >> web, with no regard for copyright, and so long as I don't turn around >> and publish that text, I have essentially no legal exposure. I can do >> statistics on it, train models on it, etc. Perhaps a content >> publisher, if they knew that I had used a large amount of their data, >> would take issue and ask me to pay something, and then perhaps we'd >> have a discussion of fair use, or perhaps we'd pay. >> >> For the immediate project I'm working on, I'll just push it to github >> after making my own personal (or corporate) determination of legal >> risk of being accused of unfair use of a bag of web pages, in a >> compressed tar file, is in a public source control repository. For the >> proposed OpenNLP podling, this will put some boundaries on them, but >> they might be happy to only check in code and 'cleared' corpora, and >> leave it to their users to apply the code to more interesting corpora. >> >> --benson >> >> >> On Fri, Nov 5, 2010 at 5:15 AM, Sim IJskes wrote: >>> On 11/05/2010 09:56 AM, Jukka Zitting wrote: >>>> >>>> Hi, >>>> >>>> On Fri, Nov 5, 2010 at 10:07 AM, Sim IJskes =C2=A0= wrote: >>>>> >>>>> Wouldn't data publicly accesible in jira be just another case of >>>>> redistribution? And by this falling within the scope of copyright >>>>> in many jurisdictions? >>>> >>>> Sure, but the "purpose and character" of a Jira attachment is much >>>> more limited than that of an official Apache release. Plus the need >>>> for explicitly documenting the licensing status is much more relaxed. >>>> We have lots of non-licensed Jira attachments that (at least to my >>>> layman mind) clearly fall within fair use for research purposes. >>> >>> I'm a layman; >>> >>> Isn't the distinction here that we are not talking about an original >>> contribution, made by the author, but with an artifact that is nothing = more >>> then an aggregation of public available material? In the jurisdiction i= live >>> under (The Netherlands), this will expose you to legal actions. If you = want >>> to know more, look at the 'Knipselkrant-arrest'. >>> >>> Gr. Sim >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org >>> For additional commands, e-mail: legal-discuss-help@apache.org >>> >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org >> For additional commands, e-mail: legal-discuss-help@apache.org >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org > For additional commands, e-mail: legal-discuss-help@apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: legal-discuss-unsubscribe@apache.org For additional commands, e-mail: legal-discuss-help@apache.org