hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: Gigablast.com search engine- 10BILLION PAGES!
Date Thu, 19 Jun 2008 17:11:23 GMT
Ted Dunning wrote:
> One way that this sort of statement can come out of a marketing person's
> mouth is if you scan 10 billion pages, decide that 95% of them will never
> appear on any results list and only actually index 500 million.

The classic way to boost your count by an order of magnitude is to 
counts a page as "indexed" if you've only indexed an anchor to it, but 
not actually downloaded and indexed the content of the page.


View raw message