hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Baldeschwieler <eri...@yahoo-inc.com>
Subject Re: Yahoo's production webmap is now on Hadoop
Date Wed, 20 Feb 2008 01:33:04 GMT
Search engine Index size comparison is actually a very inexact  
science.  Various 3rd parities comparing the major search engines do  
not come the the same conclusions.  But ours is certainly world class  
and well over the discussed sizes.

Here is an interesting bit of web history...  A blog from AUGUST 08,  
2005 discussing our index of over 19.2 billion web documents.  It has  
only grown since then.

http://www.ysearchblog.com/archives/000172.html


On Feb 19, 2008, at 2:38 PM, Ted Dunning wrote:

>
>
> Sorry to be picky about the math, but 1 Trillion = 10^12 = million  
> million.
> At 10 links per page, this gives 100 x 10^9 pages, not 1 x 10^9.   
> At 100
> links per page, this gives 10B pages.
>
>
> On 2/19/08 2:25 PM, "Peter W." <peter@marketingbrokers.com> wrote:
>
>> Amazing milestone,
>>
>> Looks like Y! had approximately 1B documents in the WebMap:
>>
>> one trillion links=(10k million links/10 links per page)=1000 million
>> pages=one billion.
>>
>> If Google has 10B docs (indexed w/25 MR jobs) then Hadoop has
>> acheived one-tenth of its scale?
>>
>> Good stuff,
>>
>> Peter W.
>>
>>
>>
>>
>> On Feb 19, 2008, at 9:58 AM, Owen O'Malley wrote:
>>
>>> The link inversion and ranking algorithms for Yahoo Search are now
>>> being generated on Hadoop:
>>>
>>> http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-
>>> largest-production-hadoop.html
>>>
>>> Some Webmap size data:
>>>
>>>     * Number of links between pages in the index: roughly 1
>>> trillion links
>>>     * Size of output: over 300 TB, compressed!
>>>     * Number of cores used to run a single Map-Reduce job: over  
>>> 10,000
>>>     * Raw disk used in the production cluster: over 5 Petabytes
>>>
>>
>


Mime
View raw message