From general-return-1353-apmail-lucene-general-archive=lucene.apache.org@lucene.apache.org Tue May 26 12:32:44 2009 Return-Path: Delivered-To: apmail-lucene-general-archive@www.apache.org Received: (qmail 71541 invoked from network); 26 May 2009 12:32:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 26 May 2009 12:32:42 -0000 Received: (qmail 9632 invoked by uid 500); 26 May 2009 12:32:53 -0000 Delivered-To: apmail-lucene-general-archive@lucene.apache.org Received: (qmail 9591 invoked by uid 500); 26 May 2009 12:32:53 -0000 Mailing-List: contact general-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: general@lucene.apache.org Delivered-To: mailing list general@lucene.apache.org Received: (qmail 9581 invoked by uid 99); 26 May 2009 12:32:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 May 2009 12:32:53 +0000 X-ASF-Spam-Status: No, hits=3.4 required=10.0 tests=HTML_MESSAGE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.119] (HELO spunkymail-a17.g.dreamhost.com) (208.97.132.119) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 May 2009 12:32:43 +0000 Received: from [192.168.0.102] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a17.g.dreamhost.com (Postfix) with ESMTP id B21BB73475 for ; Tue, 26 May 2009 05:32:22 -0700 (PDT) Message-Id: <07322D83-9F50-455F-8DD1-0774E5FB1981@apache.org> From: Grant Ingersoll To: general@lucene.apache.org Content-Type: multipart/alternative; boundary=Apple-Mail-363-893449430 Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Open Relevance Infrastucture Request Date: Tue, 26 May 2009 08:32:17 -0400 References: <1E9B66F4-4532-4ED3-BE72-B5DDE0751484@apache.org> X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-363-893449430 Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit FYI, I have sent the following message to infrastructure@a.o. If you have access to that mailing list, then you can follow the conversation there. Otherwise, I will report back on it here. -Grant Begin forwarded message: > From: Grant Ingersoll > Date: May 26, 2009 8:27:54 AM EDT > To: Apache Infrastructure > Subject: Crawling and Bandwidth > > Hi, > > Over in Lucene land, we are investigating starting a new project > that would go out and acquire and re-distribute content from the web > for use in scalability and relevance testing (http://wiki.apache.org/lucene-java/OpenRelevance > ). The content would consist of pages that we know are freely re- > distributable (Creative Commons, etc. that allow for distribution). > > Obviously, this is likely to have a bearing on ASF infrastructure, > which is why I'm writing. The crawling aspect is likely to be > discrete events lasting for a few days or a week (depending on > bandwidth throttling.) and is likely to happen a lot as we startup, > but then will stabilize over time and be less frequent. We can > likely handle this through our Lucene zone, but are not sure if it > would be capable performance wise. > > Disk space and download bandwidth, on the other hand, are likely to > be more of a concern. We anticipate having several collections > (web, mail, etc.), of varying sizes. Practically speaking, 50-100 > GB is likely the maximum size for a collection, but we probably > would have other smaller collections ranging from 100s of MBs to a > few gigs. Even so, people with really big pipes may be interested > in larger collections. Typically, when others have done this kind > of thing, they actually send out hard drives containing the data. > We are not proposing that. > > We don't anticipate an overwhelming number of downloads (it's kind > of a niche area) but we're also not sure how to even go about > estimating. We're also not sure how this should work w/ the ASF > mirroring system, if at all. > > Another option is to ask the board for funding for us to use > Amazon. I don't particularly like this approach b/c it is not > obvious to me how one would cap the cost. > > To sum up, this project (we haven't even made it an official project > yet) is purely exploratory at this point. I'm writing because we > wanted to get Infrastructure's input before foisting something on > the ASF that _could_ be a burden. > > WDYT? What concerns are we not thinking about in regards to > infrastructure? Where could we put this data and how can we > efficiently distribute it without affecting others? > > Thanks, > Grant Ingersoll --Apple-Mail-363-893449430--