Mailing-List: contact common-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: common-dev@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of Mingxi.Wu@turn.com designates
 216.75.227.229 as permitted sender)
From: Mingxi Wu <Mingxi.Wu@turn.com>
To: "common-dev@hadoop.apache.org" <common-dev@hadoop.apache.org>
Subject: Hadoop - non disk based sorting?
Thread-Topic: Hadoop - non disk based sorting?
Thread-Index: AQHMrvDhz/mC1UMQi0akmJ2CzRzIfg==
Date: Tue, 29 Nov 2011 23:44:49 +0000
Message-ID: <8473D1F51DC7684E8D6514A8FAE5C7ACF0FC55@turn-mail02.turn.corp>
References: <32876785.post@talk.nabble.com>
	<CAFh8yc91_ucx+-V0gYBS5tSVtU3Zx6+y42U5Mt807E08A9Tr3A@mail.gmail.com>
 <CADYHM8yZVMn4rP297mW=4sk_g9S7mALKr7sOch=7Vh9Fo3+kLg@mail.gmail.com>
In-Reply-To: 
 <CADYHM8yZVMn4rP297mW=4sk_g9S7mALKr7sOch=7Vh9Fo3+kLg@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi,

I have a question regarding the shuffle phase of reducer.=20

It appears when there are large map output (in my case, 5 billion records),=
 I will have out of memory Error like below.=20

Error: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.map=
red.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java=
:1592) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.=
getMapOutput(ReduceTask.java:1452) at org.apache.hadoop.mapred.ReduceTask$R=
educeCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.=
hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1=
233)

However, I thought the shuffling phase is using disk-based sort, which is n=
ot constraint by memory.=20
So, why will user run into this outofmemory error? After I increased my num=
ber of reducers from 100 to 200, the problem went away.=20

Any input regarding this memory issue would be appreciated!

Thanks,

Mingxi