Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0BACBD231 for ; Thu, 13 Sep 2012 14:05:10 +0000 (UTC) Received: (qmail 94274 invoked by uid 500); 13 Sep 2012 14:05:04 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 94206 invoked by uid 500); 13 Sep 2012 14:05:04 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 94199 invoked by uid 99); 13 Sep 2012 14:05:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Sep 2012 14:05:04 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of martin.dobmeier@gmail.com designates 74.125.83.48 as permitted sender) Received: from [74.125.83.48] (HELO mail-ee0-f48.google.com) (74.125.83.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 13 Sep 2012 14:04:58 +0000 Received: by eekd41 with SMTP id d41so2216577eek.35 for ; Thu, 13 Sep 2012 07:04:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=A065OwUsRRWFMJ+1mmkUWnWQNSNCUz5S77cgvPBPAzA=; b=Ao6YznsRnA2YPKD+q5WHwJjJw3fNzfq5ln+IXcIE816PKH8W+cgyzPwPXoaVprhVRg kUarxQ9Iu7gIa6pZjcGh2xF+4jQ1SHo5weknnWEzmo5rO2Pok+IG+W1ALcjFov+cNvhK H1bDJVx+ra6klj++uuuN64kHqKmKR69lK1S9mPJXKg9UeOYj1Kmmw0Da13cvLzTIJkEF aJNwzyNCT6g08MW0oHvxmzfuTYUmZkc0ZVfpvbtIIWfven7rMbpnYHeTyttWWweMUxR0 Yoi0uXYjTcpIXPEoHwn+Msz4iM3cmLefq0jSBY9AwlW3YVDiARGYovIu1I+VpHHVIzPr XtAw== MIME-Version: 1.0 Received: by 10.14.172.129 with SMTP id t1mr2853196eel.34.1347545076878; Thu, 13 Sep 2012 07:04:36 -0700 (PDT) Received: by 10.14.129.1 with HTTP; Thu, 13 Sep 2012 07:04:36 -0700 (PDT) Date: Thu, 13 Sep 2012 16:04:36 +0200 Message-ID: Subject: How does map-merge work exactly? From: Martin Dobmeier To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=047d7b603c12a04dcc04c995c9bd X-Virus-Checked: Checked by ClamAV on apache.org --047d7b603c12a04dcc04c995c9bd Content-Type: text/plain; charset=ISO-8859-1 Hi all, I'm greatly confused about the spill/sort/merge thing going on during the Map phase. Here are some stats: - io.sort.mb = 256 MB (80% spill threshold) - io.sort.factor = 64 - spills performed during Map: 117 - number of reducers: 96 Now I'm having real trouble understanding the following log output. ... mapred.Merger: Merging 117 sorted segments mapred.Merger: Down to the last merge-pass, with 0 segments left of total size: 0 bytes ... mapred.Merger: Merging 117 sorted segments mapred.Merger: Merging 54 intermediate segments out of a total of 56 mapred.Merger: Down to the last merge-pass, with 3 segments left of total size: 67119046 bytes ... mapred.Merger: Merging 117 sorted segments mapred.Merger: Merging 54 intermediate segments out of a total of 117 mapred.Merger: Down to the last merge-pass, with 64 segments left of total size: 1609011189 bytes ... What exactly is a segment? Is it the number of spills? What does "0 segments left" mean? Does it mean that the merge could be performed on the first pass? Why are only 54 segments merged instead of "io.sort.factor" segments? (io.sort.factor determines the number of files to merge during a pass, right?) Why is the merge performed "number of reducers" times? (I'm counting the phrase "Merging 117 segments" exactly 96 times) Thanks a lot! Martin --047d7b603c12a04dcc04c995c9bd Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi all,

I'm greatly confused about the spill/sort/merge thing go= ing on during the Map phase.

Here are some stats:
- io.sort.mb = =3D 256 MB (80% spill threshold)
- io.sort.factor =3D 64
- spills per= formed during Map: 117
- number of reducers: 96

Now I'm having real trouble understandi= ng the following log output.

...
mapred.Merger: Merging 117 sorte= d segments
mapred.Merger: Down to the last merge-pass, with 0 segments l= eft of total size: 0 bytes
...
mapred.Merger: Merging 117 sorted segments
mapred.Merger: Merging= 54 intermediate segments out of a total of 56
mapred.Merger: Down to the last merge-pass, with 3 segments left of total s= ize: 67119046 bytes
...
mapred.Merger: Merging 117 sorted segmentsmapred.Merger: Merging 54 intermediate segments out of a total of 117
mapred.Merger: Down to the last merge-pass, with 64 segments left of total = size: 1609011189 bytes
...

What exactly is a segment? Is it the n= umber of spills?
What does "0 segments left" mean? Does it mea= n that the merge could be performed on the first pass?
Why are only 54 segments merged instead of "io.sort.factor" segme= nts? (io.sort.factor determines the number of files to merge during a pass,= right?)
Why is the merge performed "number of reducers" times= ? (I'm counting the phrase "Merging 117 segments" exactly 96 = times)

Thanks a lot!
Martin

--047d7b603c12a04dcc04c995c9bd--