lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Open Relevance Infrastucture Request
Date Tue, 26 May 2009 12:32:17 GMT
FYI, I have sent the following message to infrastructure@a.o.  If you  
have access to that mailing list, then you can follow the conversation  
there.  Otherwise, I will report back on it here.


Begin forwarded message:

> From: Grant Ingersoll <>
> Date: May 26, 2009 8:27:54 AM EDT
> To: Apache Infrastructure <>
> Subject: Crawling and Bandwidth
> Hi,
> Over in Lucene land, we are investigating starting a new project  
> that would go out and acquire and re-distribute content from the web  
> for use in scalability and relevance testing (

> ).  The content would consist of pages that we know are freely re- 
> distributable (Creative Commons, etc. that allow for distribution).
> Obviously, this is likely to have a bearing on ASF infrastructure,  
> which is why I'm writing.  The crawling aspect is likely to be  
> discrete events lasting for a few days or a week (depending on  
> bandwidth throttling.) and is likely to happen a lot as we startup,  
> but then will stabilize over time and be less frequent. We can  
> likely handle this through our Lucene zone, but are not sure if it  
> would be capable performance wise.
> Disk space and download bandwidth, on the other hand, are likely to  
> be more of a concern.  We anticipate having several collections  
> (web, mail, etc.), of varying sizes.  Practically speaking, 50-100  
> GB is likely the maximum size for a collection, but we probably  
> would have other smaller collections ranging from 100s of MBs to a  
> few gigs.  Even so, people with really big pipes may be interested  
> in larger collections.  Typically, when others have done this kind  
> of thing, they actually send out hard drives containing the data.   
> We are not proposing that.
> We don't anticipate an overwhelming number of downloads (it's kind  
> of a niche area) but we're also not sure how to even go about  
> estimating.  We're also not sure how this should work w/ the ASF  
> mirroring system, if at all.
> Another option is to ask the board for funding for us to use  
> Amazon.  I don't particularly like this approach b/c it is not  
> obvious to me how one would cap the cost.
> To sum up, this project (we haven't even made it an official project  
> yet) is purely exploratory at this point.  I'm writing because we  
> wanted to get Infrastructure's input before foisting something on  
> the ASF that _could_ be a burden.
> WDYT?  What concerns are we not thinking about in regards to  
> infrastructure?  Where could we put this data and how can we  
> efficiently distribute it without affecting others?
> Thanks,
> Grant Ingersoll

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message