hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ted Dunning" <ted.dunn...@gmail.com>
Subject Re: Gigablast.com search engine- 10BILLION PAGES!
Date Thu, 19 Jun 2008 17:01:34 GMT
One way that this sort of statement can come out of a marketing person's
mouth is if you scan 10 billion pages, decide that 95% of them will never
appear on any results list and only actually index 500 million.  It could
also happen if you index 5% for real-time search and index the rest for
(very) slow search.  Either of these implementations could be described as
"indexing" 10 billion pages by somebody who doesn't understand any nuances
and who doesn't want to know the difference.  The rationalization would
proceed along the lines of statements like "so you are saying that 99
percent of the time the results are the same as if all the pages were on the

The paper that David refers to has a fairly damning graph that seems to
indicate that Gigablast has nearly 2 orders of magnitude few pages indexed
compared to yahoo for most documents and 3-4 orders of magnitude less in
some cases.

On Thu, Jun 19, 2008 at 6:26 AM, Dawid Weiss <dawid.weiss@cs.put.poznan.pl>

>  They claim to have less than 500 servers that contain 10billion pages
> Such statements are not always supported by evidence. As a side-effect of
> another experiment, we compared document-count estimates from Google, Yahoo,
> Live and Gigablast -- they seem to reflect the actual index proportions
> between these search engines.
> It's an internal tech report, so it may be rough around the edges, but even
> the illustrations should be pretty self-evident:
> http://www.cs.put.poznan.pl/dweiss/xml/publications/index.xml?lang=en&highlight=phrasals#phrasals
> Here is a direct PDF link:
> http://www.cs.put.poznan.pl/dweiss/site/publications/download/2008-weiss-chamielec.pdf
> Dawid


  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message