Return-Path: Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: (qmail 77084 invoked from network); 2 Aug 2010 18:23:30 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 2 Aug 2010 18:23:30 -0000 Received: (qmail 96778 invoked by uid 500); 2 Aug 2010 18:23:30 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 96710 invoked by uid 500); 2 Aug 2010 18:23:29 -0000 Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-user@hadoop.apache.org Delivered-To: mailing list mapreduce-user@hadoop.apache.org Received: (qmail 96702 invoked by uid 99); 2 Aug 2010 18:23:29 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Aug 2010 18:23:29 +0000 X-ASF-Spam-Status: No, hits=-8.1 required=10.0 tests=ENV_AND_HDR_SPF_MATCH,HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS,USER_IN_DEF_SPF_WL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of parimi@amazon.com designates 72.21.196.25 as permitted sender) Received: from [72.21.196.25] (HELO smtp-fw-2101.amazon.com) (72.21.196.25) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 Aug 2010 18:23:23 +0000 X-IronPort-AV: E=Sophos;i="4.55,304,1278288000"; d="scan'208,217";a="556026860" Received: from smtp-in-9002.sea19.amazon.com ([10.186.174.20]) by smtp-border-fw-out-2101.iad2.amazon.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 02 Aug 2010 18:23:01 +0000 Received: from ex-hub-9002.ant.amazon.com (ex-hub-9002.sea19.amazon.com [10.185.11.25]) by smtp-in-9002.sea19.amazon.com (8.12.11/8.12.11) with ESMTP id o72IN1jv018779 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=FAIL) for ; Mon, 2 Aug 2010 18:23:01 GMT Received: from EX-SEA19-B.ant.amazon.com ([fe80::556e:ef42:8550:b19c]) by ex-hub-9002.ant.amazon.com ([::1]) with mapi; Mon, 2 Aug 2010 11:22:56 -0700 From: "Parimi, Nagender" To: "mapreduce-user@hadoop.apache.org" Date: Mon, 2 Aug 2010 11:22:26 -0700 Subject: How does the framework sort data? Thread-Topic: How does the framework sort data? Thread-Index: Acsyb6iPJ2Tc9cFpQO6mXySUnIL/xg== Message-ID: <01CAF0DE65658F4B8FB598F8D119E6FBCDF79326@EX-SEA19-B.ant.amazon.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-cr-hashedpuzzle: AbbY CFMd CIPD CTAk C80v D9IX EgwD FAfF HThl H20P JnfK Jxuq KL7o KkYk KsCY K/Nu;1;bQBhAHAAcgBlAGQAdQBjAGUALQB1AHMAZQByAEAAaABhAGQAbwBvAHAALgBhAHAAYQBjAGgAZQAuAG8AcgBnAA==;Sosha1_v1;7;{80E37632-4305-42C9-A62C-C7FB77B6BB11};cABhAHIAaQBtAGkAQABhAG0AYQB6AG8AbgAuAGMAbwBtAA==;Mon, 02 Aug 2010 18:22:26 GMT;SABvAHcAIABkAG8AZQBzACAAdABoAGUAIABmAHIAYQBtAGUAdwBvAHIAawAgAHMAbwByAHQAIABkAGEAdABhAD8A x-cr-puzzleid: {80E37632-4305-42C9-A62C-C7FB77B6BB11} acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_01CAF0DE65658F4B8FB598F8D119E6FBCDF79326EXSEA19Bantamaz_" MIME-Version: 1.0 --_000_01CAF0DE65658F4B8FB598F8D119E6FBCDF79326EXSEA19Bantamaz_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hi, This is an admittedly na=EFve question, but I've been unable to find a comp= rehensive answer online. I have gone through the tutorial a few times (http= ://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html), and my ques= tion is simple: who or what performs the sort in MapReduce? The tutorial ab= ove states the following in a few places - "The framework sorts the outputs of the maps, which are then input to the r= educe tasks" "The Mapper outputs are sorted and then partitioned per Reducer" But this glosses over an important detail - who's sorting the mappers' outp= uts and how? Sorting huge amounts of data isn't cheap, hence my interest. The tutorial mentions that the Reducer performs 3-4 steps - Shuffle, Sort, = a possible Secondary Sort on values, and lastly Reduce. After which it stat= es - "The shuffle and sort phases occur simultaneously; while map-outputs are be= ing fetched they are merged" I've been told that mappers sort their outputs and then merge sort is used = to combine them. So is it some randomly chosen reducers that perform the me= rges? I would love to know more about the details, anyone know? If you know= of a doc that explains it, I'd appreciate if you could pass it along! thanks, Nagender --_000_01CAF0DE65658F4B8FB598F8D119E6FBCDF79326EXSEA19Bantamaz_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

Hi,

 

This is an admittedly na=EFve question, but I’ve= been unable to find a comprehensive answer online. I have gone through the tutor= ial a few times (= http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html), and my question is simple: who or what performs the sort in MapReduce? The tutorial above states the following in a few places -

 

“The framework sorts the outputs of the maps, wh= ich are then input to the reduce tasks”

 

“The Mapper outputs are sorted and then partitio= ned per Reducer”

 

But this glosses over an important detail - who’= s sorting the mappers’ outputs and how? Sorting huge amounts of data isn’t cheap, hence my interest.

 

The tutorial mentions that the Reducer performs 3-4 st= eps - Shuffle, Sort, a possible Secondary Sort on values, and lastly Reduce. Afte= r which it states -

 

“The shuffle and sort phases occur simultaneousl= y; while map-outputs are being fetched they are merged”

 

I’ve been told that mappers sort their outputs a= nd then merge sort is used to combine them. So is it some randomly chosen redu= cers that perform the merges? I would love to know more about the details, anyon= e know? If you know of a doc that explains it, I’d appreciate if you co= uld pass it along!

 

thanks,

Nagender

--_000_01CAF0DE65658F4B8FB598F8D119E6FBCDF79326EXSEA19Bantamaz_--