hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Radim Kolar <...@sendmail.cz>
Subject Re: unsort algorithmus in map/reduce
Date Thu, 27 Oct 2011 09:36:40 GMT
 > If on the other hand, you want to guarantee that you don't swamp the 
servers on each domain and you are trying to throttle
 > your fetchers, then you want to do something like re-write the urls 
to be backwards:
>
> com.test.www/http/page1.html
> com.test.www/http/page2.html
> com.test.www/http/page3.html
> com.test2.www/http/page1.html
> com.test2.www/http/page2.html
I didnt get why they have to be backwards because if we are interested 
in URL queue  distance from same origin server then distance is same.

or you wanted to reverse them like

page1.html/com.test.www/http
page1.html/com.test2.www/http

then i am not sure if this ordering is better then pure random or md5.

> and use a total ordering of the sort. (You'll need to sample the data 
> to pick the cut points.) That will limit each site to one or 
> occasionally two mappers and thus the maximum number of concurrent 
> fetchers will be the number of threads in each mapper.
I need to spread site between as much mappers as possible because there 
is crawl delay between requests per site.

Mime
View raw message