hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "MengWang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-279) Map-Reduce 2.0
Date Fri, 04 Mar 2011 11:09:36 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13002543#comment-13002543

MengWang commented on MAPREDUCE-279:


How shuffle works in MapReduce 2.0 ?

Our study shows that shuffle is a performance bottleneck of mapreduce computing. There are
some problems of shuffle:
(1)Shuffle and reduce are tightly-coupled, usually shuffle phase doesn't consume too much
memory and CPU, so theoretically, reducetasks's slot can be used for other computing tasks
when copying data from maps. This method will enhance cluster utilization. Furthermore, should
shuffle be separated from reduce? Then shuffle will not use reduce's slot,we need't distinguish
between map slots and reduce slots at all.
(2)For large jobs, shuffle will use too many network connections, Data transmitted by each
network connection is very little, which is inefficient. From 0.21.0 one connection can transfer
several map outputs, but i think this is not enough. Maybe we can use a per node shuffle client
progress(like tasktracker) to shuffle data for all reduce tasks on this node, then we can
shuffle more data trough one connection.
(3)Too many concurrent connections will cause shuffle server do massive random IO, which is
inefficient. Maybe we can aggregate http request(like delay scheduler), then random IO will
be sequential.
(4)How to manage memory used by shuffle efficiently. We use buddy memory allocation, which
will waste a considerable amount of memory.
(5)If shuffle separated from reduce, then we must figure out how to do reduce locality?
(6)Can we store map outputs in a Storage system(like hdfs)?
(7)Can shuffle be a general data transfer service, which not only for map/reduce paradigm?

> Map-Reduce 2.0
> --------------
>                 Key: MAPREDUCE-279
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-279
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker, tasktracker
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.23.0
> Re-factor MapReduce into a generic resource scheduler and a per-job, user-defined component
that manages the application execution. 

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message