hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radim Kolar <...@sendmail.cz>
Subject Re: unsort algorithmus in map/reduce
Date Tue, 25 Oct 2011 15:35:06 GMT
Dne 25.10.2011 14:21, Niels Basjes napsal(a):
> Why not do something very simple: Use the MD5 of the URL as the key 
> you do the sorting by.
> This scales very easy and highly randomized order.
> Maybe not the optimal maximum distance, but certainly a very good 
> distribution and very easy to built.
I tried it and problem is that sites with lot of URLs block queue. You 
can have few sites with 5m urls and they take major portion of queue and 
small sites are not crawled.

View raw message