hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "luoli (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1690) Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
Date Mon, 12 Apr 2010 03:58:42 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

luoli updated MAPREDUCE-1690:
-----------------------------

    Attachment: mapreduce-1690.v1.patch

This is the patch file diff from branch-0.20 , just  the buddySystem code and unittest, havn't
modify the ReduceTask.java code yet because it is so hard to merge the code of svn branch
and the code which ourselves are using right now. I will merge the buddy to hadoop code and
upload the patch v2 late.

> Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1690
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.2, 0.20.3
>            Reporter: luoli
>             Fix For: 0.20.2, 0.20.3
>
>         Attachments: mapreduce-1690.v1.patch
>
>
>        When the reduce task launched, it will start several MapOutputCopier threads to
download the output from finished map, every thread is a MapOutputCopier thread running instance.
Every time the thread trying to copy map output from remote from local, the MapOutputCopier
thread will desides to shuffle the map output data in memory or to disk, this depends on the
map output data size and the configuration of the ShuffleRamManager which loaded from the
client hadoop-site.xml or JobConf, no matter what, if the reduce task decides to shuffle the
map output data in memory , the MapOutputCopier will connect to the remote map host , read
the map output in the socket, and then  copy map-output into an in-memory buffer, and every
time, the in-memory buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here
is where the problem begin. In our cluster, there are some special jobs which will process
a huge number of original data, say 110TB,  so the reduce tasks will shuffle a lot of data,
some shuffled to disk and some shuffle in memory, even though, their will be a lot of data
shuffled in memory, and every time the MapOutputCopier threads will "new" some memory from
the reduce heap, for a long-running-huge-data job, this will easily feed the Reduce Task's
heap size to the full,  make the reduce task to OOM and then exhausted the memory of the TaskTracker
machine.
>        Here is our solution: Change the code logic when MapOutputCopier threads shuffle
map-output in memory, using a BuddySystem similar to the Linux Kernel  BuddySystem which used
to allocate and deallocate memory page. When the reduce task launched , initialize some memory
to this BuddySystem, say 128MB, everytime the reduce want to shuffle map-output in memory
,just require memory buffer from the buddySystem, if the buddySystem has enough memory , use
it, and if not , let  the MapOutputCopier threads to wait() just like what they do right now
in the current hadoop shuffle code logic. This will reduce the Reduce Task's memory usage
and reduce the TaskTracker memory shortage a lot. In our cluster, this buddySystem makes the
situation of "lost a batch of tasktrackers because of memory over used when the huge jobs
running  "  disappeared. And therefore makes the cluster more stable.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message