Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 7510D93D7 for ; Thu, 27 Oct 2011 09:38:38 +0000 (UTC) Received: (qmail 90904 invoked by uid 500); 27 Oct 2011 09:38:37 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 90801 invoked by uid 500); 27 Oct 2011 09:38:35 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 90781 invoked by uid 99); 27 Oct 2011 09:38:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Oct 2011 09:38:32 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests= X-Spam-Check-By: apache.org Received-SPF: error (athena.apache.org: local policy) Received: from [64.6.108.239] (HELO ponto.amerinoc.com) (64.6.108.239) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Oct 2011 09:38:24 +0000 Received: from fbsd8.localdomain (205.83.broadband7.iol.cz [88.102.83.205]) (authenticated bits=128) by ponto.amerinoc.com (8.14.5/8.14.5) with ESMTP id p9R9bFhB012716 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 27 Oct 2011 11:37:21 +0200 (CEST) (envelope-from hsn@sendmail.cz) Received: from [127.0.0.1] ([10.0.0.1]) by fbsd8.localdomain (8.14.4/8.14.4) with ESMTP id p9R9ajRu035274 for ; Thu, 27 Oct 2011 11:36:46 +0200 (CEST) (envelope-from hsn@sendmail.cz) Message-ID: <4EA92628.9070904@sendmail.cz> Date: Thu, 27 Oct 2011 11:36:40 +0200 From: Radim Kolar User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20110929 Thunderbird/7.0.1 MIME-Version: 1.0 To: mapreduce-user@hadoop.apache.org Subject: Re: unsort algorithmus in map/reduce References: <4EA69361.4040401@sendmail.cz> <4EA6D72A.10906@sendmail.cz> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Antivirus: avast! (VPS 111027-0, 27.10.2011), Outbound message X-Antivirus-Status: Clean > If on the other hand, you want to guarantee that you don't swamp the servers on each domain and you are trying to throttle > your fetchers, then you want to do something like re-write the urls to be backwards: > > com.test.www/http/page1.html > com.test.www/http/page2.html > com.test.www/http/page3.html > com.test2.www/http/page1.html > com.test2.www/http/page2.html I didnt get why they have to be backwards because if we are interested in URL queue distance from same origin server then distance is same. or you wanted to reverse them like page1.html/com.test.www/http page1.html/com.test2.www/http then i am not sure if this ordering is better then pure random or md5. > and use a total ordering of the sort. (You'll need to sample the data > to pick the cut points.) That will limit each site to one or > occasionally two mappers and thus the maximum number of concurrent > fetchers will be the number of threads in each mapper. I need to spread site between as much mappers as possible because there is crawl delay between requests per site.