Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 76275 invoked from network); 19 Jun 2008 17:52:39 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 19 Jun 2008 17:52:39 -0000 Received: (qmail 90656 invoked by uid 500); 19 Jun 2008 17:52:39 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 90627 invoked by uid 500); 19 Jun 2008 17:52:39 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Delivered-To: moderator for core-dev@hadoop.apache.org Received: (qmail 11771 invoked by uid 99); 19 Jun 2008 17:12:00 -0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of cutting@gmail.com designates 66.249.82.236 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding:sender; bh=Kr22XLSM73vcjTIFgIqQYjydHBCq+Nm8bxt/yVbNwp4=; b=OV4wPTuLlVxu1GcIt/cn87ZnZSOhw0GeM+Il3dpc1BRVRuEj/EkkN4RNIdm6WuqjDS a/4mYp31l0VmckoEXG5VOVvulLXhryJ+23a1Hc7r/4QmNHlesNPNuPyc6ncXoECRvxIE ait150y+rxg1xiOqtfASLLpaGbGxVbdirGJa8= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding:sender; b=cTLNa/0xiYv4tDcPEF/4igTeqK5Z7f5N9z4YhOwlJtBUtZGuvhRU/OOh6qMwyYgcdP YUjRc6sNRXaCSCADMWd982xn6U1ysE5MGbN1QVZnRwDpPjg2ZyVVRWLhubd9Bfz0WR8G QOlx/0cMoelhNHtU7Jb13jPBfnyRcjl2EP0gI= Message-ID: <485A933B.102@apache.org> Date: Thu, 19 Jun 2008 10:11:23 -0700 From: Doug Cutting User-Agent: Thunderbird 2.0.0.14 (X11/20080505) MIME-Version: 1.0 To: core-dev@hadoop.apache.org Subject: Re: Gigablast.com search engine- 10BILLION PAGES! References: <499802fc0806051220j39709729ue336405639eb0bb9@mail.gmail.com> <485A5E6B.8060908@cs.put.poznan.pl> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Sender: Doug Cutting X-Virus-Checked: Checked by ClamAV on apache.org Ted Dunning wrote: > One way that this sort of statement can come out of a marketing person's > mouth is if you scan 10 billion pages, decide that 95% of them will never > appear on any results list and only actually index 500 million. The classic way to boost your count by an order of magnitude is to counts a page as "indexed" if you've only indexed an anchor to it, but not actually downloaded and indexed the content of the page. Doug