Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 41CB59901 for ; Tue, 10 Jul 2012 03:34:22 +0000 (UTC) Received: (qmail 41496 invoked by uid 500); 10 Jul 2012 03:34:20 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 41300 invoked by uid 500); 10 Jul 2012 03:34:20 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 41267 invoked by uid 99); 10 Jul 2012 03:34:19 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jul 2012 03:34:19 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of kasha@cloudera.com designates 209.85.212.182 as permitted sender) Received: from [209.85.212.182] (HELO mail-wi0-f182.google.com) (209.85.212.182) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jul 2012 03:34:14 +0000 Received: by wibhq12 with SMTP id hq12so3007475wib.11 for ; Mon, 09 Jul 2012 20:33:53 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:x-gm-message-state; bh=W8n0N1kFVj9KRn0nv7cH/sgkbJmZ7Emu3KysTkmD7u0=; b=RMaqefTlt7U0+tVDrkUsWu3sny5gWlu1+X4EJ2DYMkNpxgezT2Gn+VYP4bVRqJV0LX dvCW1DXeTsx3VBt6QooA31SYHWrGyS3+RITD5ek7E9mI4JPRYDqXQ3u9tfB3QsagpOrt ijji8XNoysWDtQBWkoPLZt6tI20SWvxmzDQAzwmbeMJel+w6iamWf8vC5mU2YvHAHXm7 RCXIj9iQwI2f0PUdULNOfSdBqyxQX4fsyu8v/DneW4yjokp09d2/0ct4IIZps3dUYgQy 0Xb5Kh4WK5e65ICaAaHz1vbFGVQmgTXEOZD+jNtfb1p/g+SbSK+iSYHUm+tVNT/dC/8T 3l/w== MIME-Version: 1.0 Received: by 10.180.84.104 with SMTP id x8mr34460782wiy.20.1341891232782; Mon, 09 Jul 2012 20:33:52 -0700 (PDT) Received: by 10.194.48.5 with HTTP; Mon, 9 Jul 2012 20:33:52 -0700 (PDT) In-Reply-To: <1341890140.62753.YahooMailNeo@web112120.mail.gq1.yahoo.com> References: <1341711420.55453.YahooMailNeo@web112109.mail.gq1.yahoo.com> <1341863711.81310.YahooMailNeo@web112104.mail.gq1.yahoo.com> <1341890140.62753.YahooMailNeo@web112120.mail.gq1.yahoo.com> Date: Mon, 9 Jul 2012 20:33:52 -0700 Message-ID: Subject: Re: Basic question on how reducer works From: Karthik Kambatla To: mapreduce-user@hadoop.apache.org, Grandl Robert Content-Type: multipart/alternative; boundary=f46d043bd75841d95004c47166e8 X-Gm-Message-State: ALoCoQl6On48RWund6U/fwQy4krx3ueXiXkQsj8/+rpomufGVDgBKeKD9KUI97LMozE0lHVSofJ5 X-Virus-Checked: Checked by ClamAV on apache.org --f46d043bd75841d95004c47166e8 Content-Type: text/plain; charset=ISO-8859-1 The partitioner is configurable. The default partitioner, from what I remember, computes the partition as the hashcode modulo number of reducers/partitions. For random input, it is balanced, but some cases can have very skewed key distribution. Also, as you have pointed out, the number of values per key can also vary. Together, both of them determine "weight" of each partition as you call it. Karthik On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert wrote: > Thanks Arun. > > So just for my clarification. The map will create partitions according to > the number of reducers s.t. each reducer to get almost same number of keys > in its partition. However, each key can have different number of values so > the "weight" of each partition will depend on that. Also when a new value> is added into a partition a hash on the partition ID will be > computed to find the corresponding partition ? > > Robert > > ------------------------------ > *From:* Arun C Murthy > *To:* mapreduce-user@hadoop.apache.org > *Sent:* Monday, July 9, 2012 4:33 PM > > *Subject:* Re: Basic question on how reducer works > > > On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: > > Thanks a lot guys for answers. > > Still I am not able to find exactly the code for the following things: > > 1. reducer to read from a Map output only its partition. I looked into > ReduceTask#getMapOutput which do the actual read in > ReduceTask#shuffleInMemory, but I don't see where it specify which > partition to read(reduceID). > > > Look at TaskTracker.MapOutputServlet. > > 2. still don't understand very well in which part of the > code(MapTask.java) the intermediate data is written do which partition. So > MapOutputBuffer is the one who actually writes the data to buffer and spill > after buffer is full. Could you please elaborate a bit on how the data is > written to which partition ? > > > Essentially you can think of the partition-id as the 'primary key' and the > actual 'key' in the map-output of as the 'secondary key'. > > hth, > Arun > > Thanks, > Robert > > ------------------------------ > *From:* Arun C Murthy > *To:* mapreduce-user@hadoop.apache.org > *Sent:* Monday, July 9, 2012 9:24 AM > *Subject:* Re: Basic question on how reducer works > > Robert, > > On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: > > Hi, > > I have some questions related to basic functionality in Hadoop. > > 1. When a Mapper process the intermediate output data, how it knows how > many partitions to do(how many reducers will be) and how much data to go in > each partition for each reducer ? > > 2. A JobTracker when assigns a task to a reducer, it will also specify the > locations of intermediate output data where it should retrieve it right ? > But how a reducer will know from each remote location with intermediate > output what portion it has to retrieve only ? > > > To add to Harsh's comment. Essentially the TT *knows* where the output of > a given map-id/reduce-id pair is present via an output-file/index-file > combination. > > Arun > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > > > > > -- > Arun C. Murthy > Hortonworks Inc. > http://hortonworks.com/ > > > > > --f46d043bd75841d95004c47166e8 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable The partitioner is configurable. The default partitioner, from what I remem= ber, computes the partition as the hashcode modulo number of reducers/parti= tions. For random input, it is balanced, but some cases can have very skewe= d key distribution. Also, as you have pointed out, the number of values per= key can also vary. Together, both of them determine "weight" of = each partition as you call it.

Karthik

On Mon, Jul 9, 201= 2 at 8:15 PM, Grandl Robert <rgrandl@yahoo.com> wrote:
Thanks Arun.

So just for my clarification. The map will create partitions accor= ding to the number of reducers s.t. each reducer to get almost same number = of keys in its partition. However, each key can have different number of va= lues so the "weight" of each partition will depend on that. Also = when a new <key, value> is added into a partition a hash on the parti= tion ID will be computed to find the corresponding partition ?

Robert

Sent: Mo= nday, July 9, 2012 4:33 PM

Subject:<= /b> Re: Basic question on how reducer works
<= div>


On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote:
Thanks a lot guys for answers.

Still I am not able to find exactly = the code for the following things:

=
1. reducer to read from a Map output only its partition. I looke= d into ReduceTask#getMapOutput which do the actual read in ReduceTask#shuff= leInMemory, but I don't see where it specify which partition to read(re= duceID).


Look at TaskTracker.= MapOutputServlet.

2. still don't understand very well in which part of the code(MapTa= sk.java) the intermediate data is written do which partition. So MapOutputB= uffer is the one who actually writes the data to buffer and spill after buf= fer is full. Could you please elaborate a bit on how the data is written to= which partition ?


Essen= tially you can think of the partition-id as the 'primary key' and t= he actual 'key' in the map-output of <key, value> as the '= ;secondary key'.

hth,
Arun

=
Thanks,
Robert
<= div>

From: Arun C Murthy <= a= cm@hortonworks.com>
To: mapreduce-u= ser@hadoop.apache.org
Sent: Monday, July 9, 2012 9:24 AM
Subject: Re: Basic question= on how reducer works

Robert,

On Jul 7, 2012, at 6:37 PM, Grandl Robe= rt wrote:

Hi,

I have some = questions related to basic functionality in Hadoop.=A0

1. When a Mapper process the intermediate output data, = how it knows how many partitions to do(how many reducers will be) and how m= uch data to go in each=A0 partition for each reducer ?

2. A JobTracker when assigns a task to a reducer, it will also specify= the locations of intermediate output data where it should retrieve it righ= t ? But how a reducer will know from each remote location with intermediate= output what portion it has to retrieve only ?

To add to Harsh's comment. Essentially the TT *knows* where the output of a given map-id/red= uce-id pair is present via an output-file/index-file combination.

Arun

--
Arun C. Murthy




--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




--f46d043bd75841d95004c47166e8--