hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arthur Rodrigues" <arthur.almeida.rodrig...@gmail.com>
Subject Re: Gigablast.com search engine- 10BILLION PAGES!
Date Thu, 05 Jun 2008 20:23:21 GMT
On Thu, Jun 5, 2008 at 4:20 PM, Dan Segel <dansegel@gmail.com> wrote:

> Our ultimate goal is to basically replicate gigablast.com search engine.
> They claim to have less than 500 servers that contain 10billion pages
> indexed, spidered and updated on a routine basis...  I am looking at
> featuring 500 million pages indexed per node, and have a total of 20 nodes.
> Each node will feature 2 quad core processes, 4TB (at raid 5) and 32 gb of
> ram.  I believe this can be done however how many searches per second do
> you
> think would be realistic in this instance?  We are looking at achieving
> 25+/- searches per second ultimately spread out over the 20 nodes... I can
> really uses some advice with this one.
>    Thanks,
>    D. Segel

Hey Dan,

     The amount of searches you can serve per second deppends on so many
    How do you intend to distribute the data among the nodes? Or will each
node contain
    all indexed data and serve a whole request by himself? Among other
things, this statistic
   also depend on what hash function is used in order to index the pages and
on the complexity
   of ranking algorithms and so on... I guess the best configuration would
be NOT to use RAID , but
   to make something more similiar to how Google do it... You can improve
latency by paralellizing
   each request among the servers.. Give some more details...


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message