Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: mapreduce-issues@hadoop.apache.org
Message-ID: <3374982.15791271044722626.JavaMail.jira@thor>
Date: Sun, 11 Apr 2010 23:58:42 -0400 (EDT)
From: "luoli (JIRA)" <jira@apache.org>
To: mapreduce-issues@hadoop.apache.org
Subject: [jira] Updated: (MAPREDUCE-1690) Using BuddySystem to reduce the
 ReduceTask's mem usage in the step of shuffle
In-Reply-To: <7564278.7471270966482123.JavaMail.jira@thor>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


     [ https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=3Dcom.atla=
ssian.jira.plugin.system.issuetabpanels:all-tabpanel ]

luoli updated MAPREDUCE-1690:
-----------------------------

    Attachment: mapreduce-1690.v1.patch

This is the patch file diff from branch-0.20 , just  the buddySystem code a=
nd unittest, havn't modify the ReduceTask.java code yet because it is so ha=
rd to merge the code of svn branch and the code which ourselves are using r=
ight now. I will merge the buddy to hadoop code and upload the patch v2 lat=
e.

> Using BuddySystem to reduce the ReduceTask's mem usage in the step of shu=
ffle
> -------------------------------------------------------------------------=
----
>
>                 Key: MAPREDUCE-1690
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.2, 0.20.3
>            Reporter: luoli
>             Fix For: 0.20.2, 0.20.3
>
>         Attachments: mapreduce-1690.v1.patch
>
>
>        When the reduce task launched, it will start several MapOutputCopi=
er threads to download the output from finished map, every thread is a MapO=
utputCopier thread running instance. Every time the thread trying to copy m=
ap output from remote from local, the MapOutputCopier thread will desides t=
o shuffle the map output data in memory or to disk, this depends on the map=
 output data size and the configuration of the ShuffleRamManager which load=
ed from the client hadoop-site.xml or JobConf, no matter what, if the reduc=
e task decides to shuffle the map output data in memory , the MapOutputCopi=
er will connect to the remote map host , read the map output in the socket,=
 and then  copy map-output into an in-memory buffer, and every time, the in=
-memory buffer is from "byte[] shuffleData =3D new byte[mapOutputLength];",=
 here is where the problem begin. In our cluster, there are some special jo=
bs which will process a huge number of original data, say 110TB,  so the re=
duce tasks will shuffle a lot of data, some shuffled to disk and some shuff=
le in memory, even though, their will be a lot of data shuffled in memory, =
and every time the MapOutputCopier threads will "new" some memory from the =
reduce heap, for a long-running-huge-data job, this will easily feed the Re=
duce Task's heap size to the full,  make the reduce task to OOM and then ex=
hausted the memory of the TaskTracker machine.
>        Here is our solution: Change the code logic when MapOutputCopier t=
hreads shuffle map-output in memory, using a BuddySystem similar to the Lin=
ux Kernel  BuddySystem which used to allocate and deallocate memory page. W=
hen the reduce task launched , initialize some memory to this BuddySystem, =
say 128MB, everytime the reduce want to shuffle map-output in memory ,just =
require memory buffer from the buddySystem, if the buddySystem has enough m=
emory , use it, and if not , let  the MapOutputCopier threads to wait() jus=
t like what they do right now in the current hadoop shuffle code logic. Thi=
s will reduce the Reduce Task's memory usage and reduce the TaskTracker mem=
ory shortage a lot. In our cluster, this buddySystem makes the situation of=
 "lost a batch of tasktrackers because of memory over used when the huge jo=
bs running  "  disappeared. And therefore makes the cluster more stable.

--=20
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: htt=
ps://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira