hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "luoli (JIRA)" <j...@apache.org>
Subject [jira] Updated: (MAPREDUCE-1690) Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
Date Sun, 11 Apr 2010 07:22:41 GMT

     [ https://issues.apache.org/jira/browse/MAPREDUCE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

luoli updated MAPREDUCE-1690:
-----------------------------

        Fix Version/s: 0.20.3
                       0.20.2
    Affects Version/s: 0.20.2
                       0.20.3
          Description: 
       When the reduce task launched, it will start several MapOutputCopier threads to download
the output from finished map, every thread is a MapOutputCopier thread running instance. Every
time the thread trying to copy map output from remote from local, the MapOutputCopier thread
will desides to shuffle the map output data in memory or to disk, this depends on the map
output data size and the configuration of the ShuffleRamManager which loaded from the client
hadoop-site.xml or JobConf, no matter what, if the reduce task decides to shuffle the map
output data in memory , the MapOutputCopier will connect to the remote map host , read the
map output in the socket, and then  copy map-output into an in-memory buffer, and every time,
the in-memory buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here is where
the problem begin. In our cluster, there are some special jobs which will process a huge number
of original data, say 110TB,  so the reduce tasks will shuffle a lot of data, some shuffled
to disk and some shuffle in memory, even though, their will be a lot of data shuffled in memory,
and every time the MapOutputCopier threads will "new" some memory from the reduce heap, for
a long-running-huge-data job, this will easily feed the Reduce Task's heap size to the full,
 make the reduce task to OOM and then exhausted the memory of the TaskTracker machine.
       Here is our solution: Change the code logic when MapOutputCopier threads shuffle map-output
in memory, using a BuddySystem similar to the Linux Kernel  BuddySystem which used to allocate
and deallocate memory page. When the reduce task launched , initialize some memory to this
BuddySystem, say 128MB, everytime the reduce want to shuffle map-output in memory ,just require
memory buffer from the buddySystem, if the buddySystem has enough memory , use it, and if
not , let  the MapOutputCopier threads to wait() just like what they do right now in the current
hadoop shuffle code logic. This will reduce the Reduce Task's memory usage and reduce the
TaskTracker memory shortage a lot. In our cluster, this buddySystem makes the situation of
"lost a batch of tasktrackers because of memory over used when the huge jobs running  "  disappeared.
And therefore makes the cluster more stable.
          Component/s: task
                       tasktracker

   I will upload the patch code and correlation data late

> Using BuddySystem to reduce the ReduceTask's mem usage in the step of shuffle
> -----------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1690
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1690
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.2, 0.20.3
>            Reporter: luoli
>             Fix For: 0.20.2, 0.20.3
>
>
>        When the reduce task launched, it will start several MapOutputCopier threads to
download the output from finished map, every thread is a MapOutputCopier thread running instance.
Every time the thread trying to copy map output from remote from local, the MapOutputCopier
thread will desides to shuffle the map output data in memory or to disk, this depends on the
map output data size and the configuration of the ShuffleRamManager which loaded from the
client hadoop-site.xml or JobConf, no matter what, if the reduce task decides to shuffle the
map output data in memory , the MapOutputCopier will connect to the remote map host , read
the map output in the socket, and then  copy map-output into an in-memory buffer, and every
time, the in-memory buffer is from "byte[] shuffleData = new byte[mapOutputLength];", here
is where the problem begin. In our cluster, there are some special jobs which will process
a huge number of original data, say 110TB,  so the reduce tasks will shuffle a lot of data,
some shuffled to disk and some shuffle in memory, even though, their will be a lot of data
shuffled in memory, and every time the MapOutputCopier threads will "new" some memory from
the reduce heap, for a long-running-huge-data job, this will easily feed the Reduce Task's
heap size to the full,  make the reduce task to OOM and then exhausted the memory of the TaskTracker
machine.
>        Here is our solution: Change the code logic when MapOutputCopier threads shuffle
map-output in memory, using a BuddySystem similar to the Linux Kernel  BuddySystem which used
to allocate and deallocate memory page. When the reduce task launched , initialize some memory
to this BuddySystem, say 128MB, everytime the reduce want to shuffle map-output in memory
,just require memory buffer from the buddySystem, if the buddySystem has enough memory , use
it, and if not , let  the MapOutputCopier threads to wait() just like what they do right now
in the current hadoop shuffle code logic. This will reduce the Reduce Task's memory usage
and reduce the TaskTracker memory shortage a lot. In our cluster, this buddySystem makes the
situation of "lost a batch of tasktrackers because of memory over used when the huge jobs
running  "  disappeared. And therefore makes the cluster more stable.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

Mime
View raw message