Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3E55997E4 for ; Mon, 24 Sep 2012 10:07:10 +0000 (UTC) Received: (qmail 84618 invoked by uid 500); 24 Sep 2012 10:07:09 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 84510 invoked by uid 500); 24 Sep 2012 10:07:09 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 84211 invoked by uid 500); 24 Sep 2012 10:07:08 -0000 Delivered-To: apmail-hadoop-hive-dev@hadoop.apache.org Received: (qmail 84205 invoked by uid 99); 24 Sep 2012 10:07:08 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Sep 2012 10:07:08 +0000 Date: Mon, 24 Sep 2012 21:07:07 +1100 (NCT) From: "Namit Jain (JIRA)" To: hive-dev@hadoop.apache.org Message-ID: <922086545.116099.1348481227945.JavaMail.jiratomcat@arcas> In-Reply-To: <2056877466.116016.1348476967857.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (HIVE-3502) design efficient bucketing techniques MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-3502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13461708#comment-13461708 ] Namit Jain commented on HIVE-3502: ---------------------------------- A very useful follow-up optimization for this can be: For any hive query, which requires more than 1 MR job, the second MR job has mostly an identity mapper and most of the work is done in the reducer. If the output of the first MR job can be bucketized based on the requirements of the 2nd MR job, the 2nd MR job does not need a reducer at all. > design efficient bucketing techniques > ------------------------------------- > > Key: HIVE-3502 > URL: https://issues.apache.org/jira/browse/HIVE-3502 > Project: Hive > Issue Type: New Feature > Components: Query Processor > Reporter: Namit Jain > > Currently, the bucketing techniques are fairly expensive - The bucketing keys > have to be the same as the reduction keys and the process of bucketization requires > a fully blown map-reduce job. > It should be possible to perform a map-side bucketization. The high level idea is > to shard the data based on the number of buckets, and create a sub-directory for each > bucket. Then, the data from all the mappers (in the same sub-directory) can be merged. > So, instead of having 1 file per directory, it would lead to 1 directory per directory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira