Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 54794 invoked from network); 13 May 2009 18:49:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 13 May 2009 18:49:34 -0000 Received: (qmail 8110 invoked by uid 500); 13 May 2009 18:49:34 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 8068 invoked by uid 500); 13 May 2009 18:49:33 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 8057 invoked by uid 99); 13 May 2009 18:49:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2009 18:49:33 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ted.dunning@gmail.com designates 209.85.217.167 as permitted sender) Received: from [209.85.217.167] (HELO mail-gx0-f167.google.com) (209.85.217.167) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 May 2009 18:49:24 +0000 Received: by gxk11 with SMTP id 11so1558371gxk.5 for ; Wed, 13 May 2009 11:49:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :from:date:message-id:subject:to:content-type; bh=8+/by+iZgZrf6+GeFcLYvAEkUPIcom5+fqL5Vat7Tng=; b=jO1cnElbohPrsciKCosttoYjthAha8btObHrKG4gVod0K4s23ATfMv4Jz/gxahXAsV g20c8R622O6eIK5wTvyIwHzyrh+2KDVKsFhYbFBT+JAcXPKhm6H6GUZCNHJkmRYwIeAJ CK3DWNUlEyRppPM++pWc9HpAF15Tn4YHf1SRM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; b=p3CQxvGO8NBaNjYRmqKH9AyKM5SZ/DyjSuj5HXu3Z568ukTqU8QsN1gKUdHVNBJ0gb HK8+ykfyVmGDpONxRYZIt9AymmjbCTAxwB20vqyabseM5dfVkE0ZFfB5/1J71p/i2nfK LiSc3kh28rikNbAtgp+rlsqeUpJxJV+8LvS9I= MIME-Version: 1.0 Received: by 10.151.128.2 with SMTP id f2mr1707762ybn.102.1242240543051; Wed, 13 May 2009 11:49:03 -0700 (PDT) In-Reply-To: <18571074-A5E8-41B1-8189-58C78EC2EE9B@apache.org> References: <951F29A3-4461-4B25-A79D-3D1493305089@apache.org> <9ac0c6aa0905111301g53c1a788p8524e1e98b1e234b@mail.gmail.com> <32E87E9E-104A-44CF-A894-426086844EB2@apache.org> <62B0E82C-14D6-48DD-A11A-7B78F5A97268@apache.org> <18571074-A5E8-41B1-8189-58C78EC2EE9B@apache.org> From: Ted Dunning Date: Wed, 13 May 2009 11:48:43 -0700 Message-ID: Subject: Re: Open Relevance Project? To: general@lucene.apache.org Content-Type: multipart/alternative; boundary=001e680f09b84b4f010469cfaae9 X-Virus-Checked: Checked by ClamAV on apache.org --001e680f09b84b4f010469cfaae9 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Crawling a reference dataset requires essentially one-time bandwidth. Also, it is possible to download, say, wikipedia in a single go. Likewise there are various web-crawls that are available for research purposes (I think). See http://webascorpus.org/ for one example. These would be single downloads. I don't entirely see the point of redoing the spidering. On Wed, May 13, 2009 at 10:56 AM, Grant Ingersoll wrote: > Good point, although you never know. We also will have some bandwidth reqs > for crawling. > > -- Ted Dunning, CTO DeepDyve --001e680f09b84b4f010469cfaae9--