hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Gigablast.com search engine- 10BILLION PAGES!
Date Thu, 05 Jun 2008 23:34:48 GMT
Web scale and web-speed search almost always means memory based search.

500 Mpages in 25GB of memory means that you have 50 bytes per document
available.  This is very small.  Conceivable for some applications, but not
likely if you want to have high quality search.

25 queries per second against such an index (memory size, not documents)
seems very doable.  Possibly even easy.  You should be able to do this with
something like SOLR.

I think you need to budget no more than 100Mpages per node (and that might
be ambitious).

On Thu, Jun 5, 2008 at 1:27 PM, Dan Segel <sales@glowmania.net> wrote:

> Our ultimate goal is to basically replicate gigablast.com search engine.
>  They claim to have less than 500 servers that contain 10billion pages
> indexed, spidered and updated on a routine basis...  I am looking at
> featuring 500 million pages indexed per node, and have a total of 20 nodes.
>  Each node will feature 2 quad core processes, 4TB (at raid 5) and 32 gb of
> ram.  I believe this can be done however how many searches per second do you
> think would be realistic in this instance?  We are looking at achieving
> 25+/- searches per second ultimately spread out over the 20 nodes... I can
> really uses some advice with this one.
>
> Thanks,
> D. Segel




-- 
ted

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message