hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Niels Basjes <Ni...@basjes.nl>
Subject Re: unsort algorithmus in map/reduce
Date Tue, 25 Oct 2011 12:21:34 GMT
Why not do something very simple: Use the MD5 of the URL as the key you do
the sorting by.
This scales very easy and highly randomized order.
Maybe not the optimal maximum distance, but certainly a very good
distribution and very easy to built.

Niels Basjes

2011/10/25 Radim Kolar <hsn@sendmail.cz>

> Hi, i am having problem implementing unsort for crawler in map/reduce.
>
> I have list of URLs waiting to fetch, they needs to be reordered for
> maximum distance between URLs from one domain.
>
> idea is to do
>  map URL -> domain, URL
>
>  test.com, http://www.test.com/page1.html
>  test.com, http://www.test.com/page2.html
>  test.com, http://www.test.com/page3.html
>  test2.com, http://www.test2.com/page1.**html<http://www.test2.com/page1.html>
>  test2.com, http://www.test2.com/page2.**html<http://www.test2.com/page2.html>
>  test2.com, http://www.test2.com/page3.**html<http://www.test2.com/page3.html>
>
>  reduce test.com, <list> -> priority, URL
>
> 10, http://www.test.com/page1.html
>  9, http://www.test.com/page2.html
>  8, http://www.test.com/page3.html
> 10, http://www.test2.com/page1.**html <http://www.test2.com/page1.html>
>  9, http://www.test2.com/page2.**html <http://www.test2.com/page2.html>
>  8, http://www.test2.com/page3.**html <http://www.test2.com/page3.html>
>
>
> Now i need to order output by key
>
> 10, http://www.test.com/page1.html
> 10, http://www.test2.com/page1.**html <http://www.test2.com/page1.html>
>  9, http://www.test.com/page2.html
>  9, http://www.test2.com/page2.**html <http://www.test2.com/page2.html>
>  8, http://www.test.com/page3.html
>  8, http://www.test2.com/page3.**html <http://www.test2.com/page3.html>
>
> and write list of URLs in this order to output files. Like 50k urls to
> file1, next 50k to file2 and so on.
>
> Can you give me an idea how to sort using mapred and how to process sorted
> data and split them into files?
>



-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Mime
View raw message