hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dawid Weiss <dawid.we...@cs.put.poznan.pl>
Subject Re: Gigablast.com search engine- 10BILLION PAGES!
Date Thu, 19 Jun 2008 17:26:07 GMT

> The paper that David refers to has a fairly damning graph that seems to
> indicate that Gigablast has nearly 2 orders of magnitude few pages indexed
> compared to yahoo for most documents and 3-4 orders of magnitude less in
> some cases.

Yep, this was also my conclusion from that piece of research. I wouldn't go as 
far as claiming the difference in index size (because we have no idea what the 
document count estimation algorithms are), but as far as our suspicion can go, 
it is like this:

- Yahoo
- Google (a bit less than Yahoo, but roughly the same)
- Live (1 order of magnitude less than the above)
- Gigablast (1 order of magnitude less than Live)

What was more interesting to me rather than raw counts was the consistency with 
which Yahoo and Live returned similar estimate of matching documents, no matter 
the time and machine the query was issued from (we did 10 independent samples). 
For Google, on the other hand, the results could vary by an order of magnitude 
(!).  Another interesting thing was the correlation between Yahoo and Live's 
results -- nearly perfect.


View raw message