Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8F629DFAE for ; Sat, 14 Jul 2012 06:09:05 +0000 (UTC) Received: (qmail 25696 invoked by uid 500); 14 Jul 2012 06:09:03 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 25511 invoked by uid 500); 14 Jul 2012 06:09:03 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 25481 invoked by uid 99); 14 Jul 2012 06:09:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Jul 2012 06:09:02 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of harsh@cloudera.com designates 209.85.160.48 as permitted sender) Received: from [209.85.160.48] (HELO mail-pb0-f48.google.com) (209.85.160.48) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 14 Jul 2012 06:08:55 +0000 Received: by pbbrq8 with SMTP id rq8so8148016pbb.35 for ; Fri, 13 Jul 2012 23:08:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:x-gm-message-state; bh=CnsgPWcsjrGwqtNqQRRcL9xY5kI3OaXpNT7UfBHiUi4=; b=fBxgQMtolJjvX1Qz3ugjAjMroHjyFzecuK0DgVai9NdJ+Z84YMc0g1GJPqjhNZanHA i3dEF9P+cW4h0+nHcUU4/HpS58TO7mwhwQRm2m7OgISqVq3qUc9dtBFfN1Jptbo0fqxA MYMKMzqfSTKuK0Y5y8D2wY18R+7Ahqka2gvE2jFsXXWeML1IBPc7LvJNCsdfbEyaktKv d36qlFFqhjZEUdIrnzsbVPvfxTaBzR/5JeYCw+m+8D7n87wvKGBzE5HbcWqAiU/c/cHt 4crnYa8spXeid3DoThmrR7ppGMHcT0SctBqopNXpkEc74UOFPEHquG01kRX48TUEGs6C h1mQ== Received: by 10.68.237.103 with SMTP id vb7mr9938030pbc.38.1342246114568; Fri, 13 Jul 2012 23:08:34 -0700 (PDT) MIME-Version: 1.0 Received: by 10.68.134.6 with HTTP; Fri, 13 Jul 2012 23:08:14 -0700 (PDT) In-Reply-To: References: <1341711420.55453.YahooMailNeo@web112109.mail.gq1.yahoo.com> <1341863711.81310.YahooMailNeo@web112104.mail.gq1.yahoo.com> <1341890140.62753.YahooMailNeo@web112120.mail.gq1.yahoo.com> From: Harsh J Date: Sat, 14 Jul 2012 11:38:14 +0530 Message-ID: Subject: Re: Basic question on how reducer works To: mapreduce-user@hadoop.apache.org Cc: Grandl Robert Content-Type: text/plain; charset=ISO-8859-1 X-Gm-Message-State: ALoCoQkv7vAobtwH9ugnQSC2JFAKd9uMjUYeEtdmI+6qnBwJS7MkAuGPS1afusSQBfgOO6eJKZ+E If you wish to impose a limit on the max reducer input to be allowed in a job, you may set "mapreduce.reduce.input.limit" on your job, as total bytes allowed per reducer. But this is more of a hard limit, which I suspect your question wasn't about. Your question is indeed better off on the pig's user lists. On Tue, Jul 10, 2012 at 8:59 PM, Subir S wrote: > Is there any property to convey the maximum amount of data each > reducer/partition may take for processing. Like the bytes_per_reducer > of pig, so that the count of reducers can be controlled based on size > of intermediate map output data size? > > On 7/10/12, Karthik Kambatla wrote: >> The partitioner is configurable. The default partitioner, from what I >> remember, computes the partition as the hashcode modulo number of >> reducers/partitions. For random input, it is balanced, but some cases can >> have very skewed key distribution. Also, as you have pointed out, the >> number of values per key can also vary. Together, both of them determine >> "weight" of each partition as you call it. >> >> Karthik >> >> On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert wrote: >> >>> Thanks Arun. >>> >>> So just for my clarification. The map will create partitions according to >>> the number of reducers s.t. each reducer to get almost same number of >>> keys >>> in its partition. However, each key can have different number of values >>> so >>> the "weight" of each partition will depend on that. Also when a new >> value> is added into a partition a hash on the partition ID will be >>> computed to find the corresponding partition ? >>> >>> Robert >>> >>> ------------------------------ >>> *From:* Arun C Murthy >>> *To:* mapreduce-user@hadoop.apache.org >>> *Sent:* Monday, July 9, 2012 4:33 PM >>> >>> *Subject:* Re: Basic question on how reducer works >>> >>> >>> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >>> >>> Thanks a lot guys for answers. >>> >>> Still I am not able to find exactly the code for the following things: >>> >>> 1. reducer to read from a Map output only its partition. I looked into >>> ReduceTask#getMapOutput which do the actual read in >>> ReduceTask#shuffleInMemory, but I don't see where it specify which >>> partition to read(reduceID). >>> >>> >>> Look at TaskTracker.MapOutputServlet. >>> >>> 2. still don't understand very well in which part of the >>> code(MapTask.java) the intermediate data is written do which partition. >>> So >>> MapOutputBuffer is the one who actually writes the data to buffer and >>> spill >>> after buffer is full. Could you please elaborate a bit on how the data is >>> written to which partition ? >>> >>> >>> Essentially you can think of the partition-id as the 'primary key' and >>> the >>> actual 'key' in the map-output of as the 'secondary key'. >>> >>> hth, >>> Arun >>> >>> Thanks, >>> Robert >>> >>> ------------------------------ >>> *From:* Arun C Murthy >>> *To:* mapreduce-user@hadoop.apache.org >>> *Sent:* Monday, July 9, 2012 9:24 AM >>> *Subject:* Re: Basic question on how reducer works >>> >>> Robert, >>> >>> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >>> >>> Hi, >>> >>> I have some questions related to basic functionality in Hadoop. >>> >>> 1. When a Mapper process the intermediate output data, how it knows how >>> many partitions to do(how many reducers will be) and how much data to go >>> in >>> each partition for each reducer ? >>> >>> 2. A JobTracker when assigns a task to a reducer, it will also specify >>> the >>> locations of intermediate output data where it should retrieve it right ? >>> But how a reducer will know from each remote location with intermediate >>> output what portion it has to retrieve only ? >>> >>> >>> To add to Harsh's comment. Essentially the TT *knows* where the output of >>> a given map-id/reduce-id pair is present via an output-file/index-file >>> combination. >>> >>> Arun >>> >>> -- >>> Arun C. Murthy >>> Hortonworks Inc. >>> http://hortonworks.com/ >>> >>> >>> >>> >>> >>> -- >>> Arun C. Murthy >>> Hortonworks Inc. >>> http://hortonworks.com/ >>> >>> >>> >>> >>> >> -- Harsh J