Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B03EBDB7D for ; Tue, 10 Jul 2012 15:30:06 +0000 (UTC) Received: (qmail 8964 invoked by uid 500); 10 Jul 2012 15:30:05 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 8737 invoked by uid 500); 10 Jul 2012 15:30:05 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 8727 invoked by uid 99); 10 Jul 2012 15:30:05 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jul 2012 15:30:05 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of subir.sasikumar@gmail.com designates 209.85.161.176 as permitted sender) Received: from [209.85.161.176] (HELO mail-gg0-f176.google.com) (209.85.161.176) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Jul 2012 15:30:00 +0000 Received: by ggnk4 with SMTP id k4so126210ggn.35 for ; Tue, 10 Jul 2012 08:29:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=AwskzJauBzGFdWgydZnru6nMe4L9RUgcDpdt9CWub7k=; b=AFQPuDTlGX0w+E7BfTDMvj/PnxD0DBz7m8a4G6Nle1jZ7OMTIH19EdjboNKoFc62SZ d8tuFwEQr5cdYwywtK7TLyftljnBuJOizgyCfqson/yxG4mUgbrRi0qasNCOQ3ZQXlpD T8mupyYTa+fFu0WzzlJHoQEl5/Ftam+36/bUPklwS6N0vJMjgVll+8i+A13RYh7HPqi2 ALtUhl0EIOjaCX8IVHfQauIYkgFm1ad2wTAgCHzKpvSyiXhABGoipMTMXG0JLR64yGLX jpfVdxwgc8sTSD2r5TYgD841WbHV9yILTwKkkcCG9NCBDg4Ray9G5bIiufahCOFS4oSZ 5gig== MIME-Version: 1.0 Received: by 10.50.220.129 with SMTP id pw1mr11813955igc.29.1341934179514; Tue, 10 Jul 2012 08:29:39 -0700 (PDT) Received: by 10.43.115.70 with HTTP; Tue, 10 Jul 2012 08:29:39 -0700 (PDT) In-Reply-To: References: <1341711420.55453.YahooMailNeo@web112109.mail.gq1.yahoo.com> <1341863711.81310.YahooMailNeo@web112104.mail.gq1.yahoo.com> <1341890140.62753.YahooMailNeo@web112120.mail.gq1.yahoo.com> Date: Tue, 10 Jul 2012 20:59:39 +0530 Message-ID: Subject: Re: Basic question on how reducer works From: Subir S To: mapreduce-user@hadoop.apache.org Cc: Grandl Robert Content-Type: text/plain; charset=ISO-8859-1 X-Virus-Checked: Checked by ClamAV on apache.org Is there any property to convey the maximum amount of data each reducer/partition may take for processing. Like the bytes_per_reducer of pig, so that the count of reducers can be controlled based on size of intermediate map output data size? On 7/10/12, Karthik Kambatla wrote: > The partitioner is configurable. The default partitioner, from what I > remember, computes the partition as the hashcode modulo number of > reducers/partitions. For random input, it is balanced, but some cases can > have very skewed key distribution. Also, as you have pointed out, the > number of values per key can also vary. Together, both of them determine > "weight" of each partition as you call it. > > Karthik > > On Mon, Jul 9, 2012 at 8:15 PM, Grandl Robert wrote: > >> Thanks Arun. >> >> So just for my clarification. The map will create partitions according to >> the number of reducers s.t. each reducer to get almost same number of >> keys >> in its partition. However, each key can have different number of values >> so >> the "weight" of each partition will depend on that. Also when a new > value> is added into a partition a hash on the partition ID will be >> computed to find the corresponding partition ? >> >> Robert >> >> ------------------------------ >> *From:* Arun C Murthy >> *To:* mapreduce-user@hadoop.apache.org >> *Sent:* Monday, July 9, 2012 4:33 PM >> >> *Subject:* Re: Basic question on how reducer works >> >> >> On Jul 9, 2012, at 12:55 PM, Grandl Robert wrote: >> >> Thanks a lot guys for answers. >> >> Still I am not able to find exactly the code for the following things: >> >> 1. reducer to read from a Map output only its partition. I looked into >> ReduceTask#getMapOutput which do the actual read in >> ReduceTask#shuffleInMemory, but I don't see where it specify which >> partition to read(reduceID). >> >> >> Look at TaskTracker.MapOutputServlet. >> >> 2. still don't understand very well in which part of the >> code(MapTask.java) the intermediate data is written do which partition. >> So >> MapOutputBuffer is the one who actually writes the data to buffer and >> spill >> after buffer is full. Could you please elaborate a bit on how the data is >> written to which partition ? >> >> >> Essentially you can think of the partition-id as the 'primary key' and >> the >> actual 'key' in the map-output of as the 'secondary key'. >> >> hth, >> Arun >> >> Thanks, >> Robert >> >> ------------------------------ >> *From:* Arun C Murthy >> *To:* mapreduce-user@hadoop.apache.org >> *Sent:* Monday, July 9, 2012 9:24 AM >> *Subject:* Re: Basic question on how reducer works >> >> Robert, >> >> On Jul 7, 2012, at 6:37 PM, Grandl Robert wrote: >> >> Hi, >> >> I have some questions related to basic functionality in Hadoop. >> >> 1. When a Mapper process the intermediate output data, how it knows how >> many partitions to do(how many reducers will be) and how much data to go >> in >> each partition for each reducer ? >> >> 2. A JobTracker when assigns a task to a reducer, it will also specify >> the >> locations of intermediate output data where it should retrieve it right ? >> But how a reducer will know from each remote location with intermediate >> output what portion it has to retrieve only ? >> >> >> To add to Harsh's comment. Essentially the TT *knows* where the output of >> a given map-id/reduce-id pair is present via an output-file/index-file >> combination. >> >> Arun >> >> -- >> Arun C. Murthy >> Hortonworks Inc. >> http://hortonworks.com/ >> >> >> >> >> >> -- >> Arun C. Murthy >> Hortonworks Inc. >> http://hortonworks.com/ >> >> >> >> >> >