Mailing-List: contact dev-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Mon, 24 Sep 2012 21:07:07 +1100 (NCT)
From: "Namit Jain (JIRA)" <jira@apache.org>
To: hive-dev@hadoop.apache.org
Message-ID: <922086545.116099.1348481227945.JavaMail.jiratomcat@arcas>
In-Reply-To: <2056877466.116016.1348476967857.JavaMail.jiratomcat@arcas>
Subject: [jira] [Commented] (HIVE-3502) design efficient bucketing
 techniques
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HIVE-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461708#comment-13461708 ] 

Namit Jain commented on HIVE-3502:
----------------------------------

A very useful follow-up optimization for this can be:

For any hive query, which requires more than 1 MR job, the second MR job has mostly an identity mapper
and most of the work is done in the reducer. If the output of the first MR job can be bucketized based
on the requirements of the 2nd MR job, the 2nd MR job does not need a reducer at all.
                
> design efficient bucketing techniques
> -------------------------------------
>
>                 Key: HIVE-3502
>                 URL: https://issues.apache.org/jira/browse/HIVE-3502
>             Project: Hive
>          Issue Type: New Feature
>          Components: Query Processor
>            Reporter: Namit Jain
>
> Currently, the bucketing techniques are fairly expensive - The bucketing keys 
> have to be the same as the reduction keys and the process of bucketization requires
> a fully blown map-reduce job.
> It should be possible to perform a map-side bucketization. The high level idea is
> to shard the data based on the number of buckets, and create a sub-directory for each
> bucket. Then, the data from all the mappers (in the same sub-directory) can be merged.
> So, instead of having 1 file per directory, it would lead to 1 directory per directory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira