hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-3502) design efficient bucketing techniques
Date Mon, 24 Sep 2012 10:07:07 GMT

    [ https://issues.apache.org/jira/browse/HIVE-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461708#comment-13461708
] 

Namit Jain commented on HIVE-3502:
----------------------------------

A very useful follow-up optimization for this can be:

For any hive query, which requires more than 1 MR job, the second MR job has mostly an identity
mapper
and most of the work is done in the reducer. If the output of the first MR job can be bucketized
based
on the requirements of the 2nd MR job, the 2nd MR job does not need a reducer at all.
                
> design efficient bucketing techniques
> -------------------------------------
>
>                 Key: HIVE-3502
>                 URL: https://issues.apache.org/jira/browse/HIVE-3502
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Currently, the bucketing techniques are fairly expensive - The bucketing keys 
> have to be the same as the reduction keys and the process of bucketization requires
> a fully blown map-reduce job.
> It should be possible to perform a map-side bucketization. The high level idea is
> to shard the data based on the number of buckets, and create a sub-directory for each
> bucket. Then, the data from all the mappers (in the same sub-directory) can be merged.
> So, instead of having 1 file per directory, it would lead to 1 directory per directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message