hadoop-hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma (JIRA)" <j...@apache.org>
Subject [jira] Created: (HIVE-1467) dynamic partitioning should cluster by partitions
Date Thu, 15 Jul 2010 20:14:50 GMT
dynamic partitioning should cluster by partitions
-------------------------------------------------

                 Key: HIVE-1467
                 URL: https://issues.apache.org/jira/browse/HIVE-1467
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: Joydeep Sen Sarma


(based on internal discussion with Ning). Dynamic partitioning should offer a mode where it
clusters data by partition before writing out to each partition. This will reduce number of
files. Details:

1. always use reducer stage
2. mapper sends to reducer based on partitioning column. ie. reducer = f(partition-cols)
3. f() can be made somewhat smart to:

   a. spread large partitions across multiple reducers - each mapper can maintain row count
seen per partition - and then apply (whenever it sees a new row for a partition): 
       * reducer = (row count / 64k) % numReducers 
       Small partitions always go to one reducer. the larger the partition, the more the reducers.
this prevents one reducer becoming bottleneck writing out one partition

   b. this still leaves the issue of very large number of splits. (64K rows from 10K mappers
is pretty large). for this one can apply one slight modification:
       * reducer = (mapper-id/1024 + row-count/64k) % numReducers

   ie. - the first 1000 mappers always send the first 64K rows for one partition to the same
reducer. the next 1000 send it to the next one. and so on.

the constants 1024 and 64k are used just as an example. i don't know what the right numbers
are. it's also clear that this is a case where we need hadoop to do only partitioning (and
no sorting). this will be a useful feature to have in hadoop. that will reduce the overhead
due to reducers.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message