hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "luoli (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1690) Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
Date Tue, 27 Apr 2010 07:33:35 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12861284#action_12861284
] 

luoli commented on MAPREDUCE-1690:
----------------------------------

Thank you Arun, I will upload a patch and  put this into trunk soon.

bq. What is the correlation between the shuffle and 'lost tasktracker'?
There were sometimes on our cluster that the reduces shuffled too many map-output in memory
 in a short moment and will fill the reduce tasks' heap size to maximum. And if the -Xmx too
big ,and using the shuffle in memory logic in reduce task right now, meanwhile there are several
reduce task running on the tasktracker, that will makes the situation of "tasktracker get
lost because of memory of the slave machine overused".

> Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1690
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.2, 0.20.3
>            Reporter: luoli
>             Fix For: 0.20.2
>
>         Attachments: allo_use_buddy.JPG, allo_use_buddy_gc.JPG, allo_use_new.JPG, allo_use_new_gc.JPG,
mapreduce-1690.v1.patch, mapreduce-1690.v1.patch, mapreduce-1690.v1.patch, mapreduce-1690.v2.patch
>
>
>        When the reduce task launched, it will start several MapOutputCopier threads to
download the output from finished map, every thread is a MapOutputCopier thread running instance.
Every time the thread trying to copy map output from remote from local, the MapOutputCopier
thread will desides to shuffle the map output data in memory or to disk, this depends on the
map output data size and the configuration of the ShuffleRamManager which loaded from the
client hadoop-site.xml or JobConf, no matter what, if the reduce task decides to shuffle the
map output data in memory , the MapOutputCopier will connect to the remote map host , read
the map output in the socket, and then  copy map-output into an in-memory buffer, and every
time, the in-memory buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here
is where the problem begin. In our cluster, there are some special jobs which will process
a huge number of original data, say 110TB,  so the reduce tasks will shuffle a lot of data,
some shuffled to disk and some shuffle in memory, even though, their will be a lot of data
shuffled in memory, and every time the MapOutputCopier threads will "new" some memory from
the reduce heap, for a long-running-huge-data job, this will easily feed the Reduce Task's
heap size to the full,  make the reduce task to OOM and then exhausted the memory of the TaskTracker
machine.
>        Here is our solution: Change the code logic when MapOutputCopier threads shuffle
map-output in memory, using a BuddySystem similar to the Linux Kernel  BuddySystem which used
to allocate and deallocate memory page. When the reduce task launched , initialize some memory
to this BuddySystem, say 128MB, everytime the reduce want to shuffle map-output in memory
,just require memory buffer from the buddySystem, if the buddySystem has enough memory , use
it, and if not , let  the MapOutputCopier threads to wait() just like what they do right now
in the current hadoop shuffle code logic. This will reduce the Reduce Task's memory usage
and reduce the TaskTracker memory shortage a lot. In our cluster, this buddySystem makes the
situation of "lost a batch of tasktrackers because of memory over used when the huge jobs
running  "  disappeared. And therefore makes the cluster more stable.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message