Mailing-List: contact mapreduce-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-user@hadoop.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <BAY170-W13611288A05729824380B31F05A0@phx.gbl>
References: 
 <CAD63Ac7qX88fKOnHNLZqtFhBw4tni-vB7pevY2Atx32XTfFFFQ@mail.gmail.com>
	<BAY170-W13611288A05729824380B31F05A0@phx.gbl>
Date: Sat, 10 Mar 2012 20:08:30 -0800
Message-ID: 
 <CAD63Ac64QvPwhvqDXDbMQE8vM4zuVZ=ZR4JeDpFo6P7T=Lm5sg@mail.gmail.com>
Subject: Re: Mapper Record Spillage
From: Hans Uhlig <huhlig@uhlisys.com>
To: mapreduce-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=f46d043bdb7648a2db04baefc7b3

--f46d043bdb7648a2db04baefc7b3
Content-Type: text/plain; charset=ISO-8859-1

I am attempting to specify this for a single job during its
creation/submission. Not via the general construct. I am using the new api
so I am adding the values to the conf passed into new Job();

2012/3/10 WangRamon <ramon_wang@hotmail.com>

>  How man map/reduce tasks slots do you have for each node? If the
> total number is 10, then you will use 10 * 4096mb memory when all tasks are
> running, which is bigger than the total memory 32G you have for each node.
>
> ------------------------------
> Date: Sat, 10 Mar 2012 20:00:13 -0800
> Subject: Mapper Record Spillage
> From: huhlig@uhlisys.com
> To: mapreduce-user@hadoop.apache.org
>
> I am attempting to speed up a mapping process whose input is GZIP compressed
> CSV files. The files range from 1-2GB, I am running on a Cluster where each
> node has a total of 32GB memory available to use. I have attempted to tweak
> mapred.map.child.jvm.opts with -Xmx4096mb and io.sort.mb to 2048 to accommodate
> the size but I keep getting java heap errors or other memory related
> problems. My row count per mapper is well below Integer.MAX_INTEGER limi t
> by several orders of magnitude and the box is NOT using anywhere close to its
> full memory allotment. How can I specify that this map task can have 3-4
> GB of memory for the collection, partition and sort process without constantly
> spilling records to disk?
>

--f46d043bdb7648a2db04baefc7b3
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I am attempting to specify this for a single job during its creation/submis=
sion. Not via the general construct. I am using the new api so I am adding =
the values to the conf passed into new Job();<br><br><div class=3D"gmail_qu=
ote">
2012/3/10 WangRamon <span dir=3D"ltr">&lt;<a href=3D"mailto:ramon_wang@hotm=
ail.com">ramon_wang@hotmail.com</a>&gt;</span><br><blockquote class=3D"gmai=
l_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left=
:1ex">


<div><div dir=3D"ltr">
How man=A0map/reduce tasks=A0slots do you have for each node? If=A0the tota=
l=A0number is 10, then you will use 10 * 4096mb memory when all tasks are r=
unning, which is bigger than the total memory 32G=A0you have for=A0each nod=
e.<br>=A0<br>
<div><div></div><hr>Date: Sat, 10 Mar 2012 20:00:13 -0800<br>Subject: Mappe=
r Record Spillage<br>From: <a href=3D"mailto:huhlig@uhlisys.com" target=3D"=
_blank">huhlig@uhlisys.com</a><br>To: <a href=3D"mailto:mapreduce-user@hado=
op.apache.org" target=3D"_blank">mapreduce-user@hadoop.apache.org</a><br>
<br><span>I am attempting to speed up a mapping process whose input is GZIP=
 compresse</span><span>d CSV files. The files range from 1-2GB, I am runnin=
g on a Cluster where ea</span><span>ch node has a total of 32GB memory avai=
lable to use. I have attempted to tw</span><span>eak mapred.map.child.jvm.o=
pts with -Xmx4096mb and io.sort.mb to 2048 to acc</span><span>ommodate the =
size but I keep getting java heap errors or other memory relat</span><span>=
ed problems. My row count per mapper is well below Integer.MAX_INTEGER limi=
</span>
 <span>t by several orders of magnitude and the box is NOT using anywhere c=
lose to</span><span>=A0its full memory allotment. How can I specify that th=
is map task can have 3</span><span>-4 GB of memory for the collection, part=
ition and sort process without cons</span><span>tantly spilling records to =
disk?</span></div>
 		 	   		  </div></div>
</blockquote></div><br>

--f46d043bdb7648a2db04baefc7b3--