hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Binglin Chang (Created) (JIRA)" <j...@apache.org>
Subject [jira] [Created] (MAPREDUCE-3247) Add hash aggregation style data flow and/or new API
Date Sat, 22 Oct 2011 12:44:34 GMT
Add hash aggregation style data flow and/or new API

                 Key: MAPREDUCE-3247
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3247
             Project: Hadoop Map/Reduce
          Issue Type: New Feature
          Components: task
    Affects Versions: 0.23.0
            Reporter: Binglin Chang

In many join/aggregation like queries run on top of mapreduce, sort is not need, in fact a
hash table based join/aggregation is more efficient, this is described in "Tenzing A SQL Implementation
On The MapReduce Framework" in detail. There are two ways to support hash table based join/aggregation
in hadoop mapreduce:

# Only support no sort, the framework do nothing, just pass partitioned k/v pair from mapper
to reducer
   The upper application use hash table in their mapper & reducer to do aggregation, and
emit all hashtable enties in cleanup() of mapper/reducer, this is how Google did in Tenzing.
The main problem is memory control of hashtable.

# Add new "fold" API, it can coexist with combiner/reducer API, user can use mapper-combiner-reducer
or "mapper-folder" (maybe a bad name, welcome to propose a better name..)
   Like foldl in functional programming: folder should have the semantic:
     foldl folder z (x:xs)  =   foldl folder (folder z x) xs
   In this way, upper applications only need to provide folder, underlying framework create
and maintains hashtable for key/value pairs, it can be managed & optimized by the framework.
For example, in mapper side, we can pre emit entire hashtable or use some policies like cache
algorithm to emit part of k/v pairs to free some memory, if the memory consumption reach io.sort.mb

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message