hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binglin Chang (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3397) Support no sort dataflow in map output and reduce merge phrase
Date Tue, 15 Nov 2011 02:56:52 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150208#comment-13150208
] 

Binglin Chang commented on MAPREDUCE-3397:
------------------------------------------

I think why no sort make sense is that, in many cases application has a more efficient way
to process data(such as do aggregation on the fly), they don't want the framework to do some
sort of heavy weighted data preprocessing, cause they have better prior knowledge/understanding
about the data and the goal.

                
> Support no sort dataflow in map output and reduce merge phrase
> --------------------------------------------------------------
>
>                 Key: MAPREDUCE-3397
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3397
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>          Components: task
>    Affects Versions: 0.20.205.0
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>         Attachments: MAPREDUCE-3397-nosort.v1.patch
>
>
> In our experience, many data aggregation style queries/jobs don't need to sort the intermediate
data. In fact reducer side can use hashmap or even array to do application level aggregations.
For example, consider computing CTR using display log & click log in sponsored search.
Map side just emit (adv_id, clk_cnt, dis_cnt), reduce side aggregate clk_cnt and dis_cnt for
every adv_id, cause adv_id is integer, we can partition adv_id by range:
> ** reduce0: 0-100000
> ** reduce1: 100000-200000
> ** ...
> ** reduceM: xxx-max adv-id
> Then the reducer can use an array(for example: int [1000000][2]) to store the aggregated
clk_cnt & dis_cnt, and we don't need the framework to sort intermediate data anymore.
> By supporting no sort, we can gain a lot of performance improvements:
> # Eliminate map side sort & merge. 
>   KV paris need to sort by partition first, but this can be done using a liner time counting
sort, which is much faster than quick sort.
>   Just merge spill segments one by one, doesn't need to use heap merge.
> # Eliminate shuffle phrase barrier, reducer can start to processing data before all map
output data are copied & merged.
> For most cases, memory won't be a problem, cause keys are divided to many partitions,
each reducers only process a small subset of the global key set. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message