Return-Path: X-Original-To: apmail-hive-dev-archive@www.apache.org Delivered-To: apmail-hive-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14B8010B72 for ; Fri, 16 May 2014 10:13:42 +0000 (UTC) Received: (qmail 94766 invoked by uid 500); 16 May 2014 10:13:38 -0000 Delivered-To: apmail-hive-dev-archive@hive.apache.org Received: (qmail 94446 invoked by uid 500); 16 May 2014 10:13:36 -0000 Mailing-List: contact dev-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list dev@hive.apache.org Received: (qmail 94419 invoked by uid 99); 16 May 2014 10:13:36 -0000 Received: from reviews-vm.apache.org (HELO reviews.apache.org) (140.211.11.40) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 16 May 2014 10:13:36 +0000 Received: from reviews.apache.org (localhost [127.0.0.1]) by reviews.apache.org (Postfix) with ESMTP id E3D841D7D47; Fri, 16 May 2014 06:08:40 +0000 (UTC) Content-Type: multipart/alternative; boundary="===============1511337113004233664==" MIME-Version: 1.0 Subject: Review Request 21549: Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator From: "Navis Ryu" To: "Navis Ryu" , "hive" Date: Fri, 16 May 2014 06:08:40 -0000 Message-ID: <20140516060840.716.36001@reviews.apache.org> X-ReviewBoard-URL: https://reviews.apache.org Auto-Submitted: auto-generated Sender: "Navis Ryu" X-ReviewGroup: hive X-ReviewRequest-URL: https://reviews.apache.org/r/21549/ X-Sender: "Navis Ryu" Reply-To: "Navis Ryu" X-ReviewRequest-Repository: hive-git --===============1511337113004233664== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/21549/ ----------------------------------------------------------- Review request for hive. Bugs: HIVE-4867 https://issues.apache.org/jira/browse/HIVE-4867 Repository: hive-git Description ------- A ReduceSinkOperator emits data in the format of keys and values. Right now, a column may appear in both the key list and value list, which result in unnecessary overhead for shuffling. Example: We have a query shown below ... {code:sql} explain select ss_ticket_number from store_sales cluster by ss_ticket_number; {\code} The plan is ... {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: store_sales TableScan alias: store_sales Select Operator expressions: expr: ss_ticket_number type: int outputColumnNames: _col0 Reduce Output Operator key expressions: expr: _col0 type: int sort order: + Map-reduce partition columns: expr: _col0 type: int tag: -1 value expressions: expr: _col0 type: int Reduce Operator Tree: Extract File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {\code} The column 'ss_ticket_number' is in both the key list and value list of the ReduceSinkOperator. The type of ss_ticket_number is int. For this case, BinarySortableSerDe will introduce 1 byte more for every int in the key. LazyBinarySerDe will also introduce overhead when recording the length of a int. For every int, 10 bytes should be a rough estimation of the size of data emitted from the Map phase. Diffs ----- ql/src/java/org/apache/hadoop/hive/ql/Driver.java 9040d9b ql/src/java/org/apache/hadoop/hive/ql/exec/ColumnInfo.java acaca23 ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java fc5864a ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 22374b2 ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 6368548 ql/src/java/org/apache/hadoop/hive/ql/exec/RowSchema.java 083d574 ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7250432 ql/src/java/org/apache/hadoop/hive/ql/hooks/LineageInfo.java 22a8785 ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java 6a4dc9b ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java e3e0acc ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/CorrelationOptimizer.java 86e4834 ql/src/java/org/apache/hadoop/hive/ql/optimizer/correlation/ReduceSinkDeDuplication.java 719fe9f ql/src/java/org/apache/hadoop/hive/ql/optimizer/lineage/ExprProcCtx.java 7cf48a7 ql/src/java/org/apache/hadoop/hive/ql/optimizer/lineage/ExprProcFactory.java b5cdde1 ql/src/java/org/apache/hadoop/hive/ql/optimizer/lineage/OpProcFactory.java 78b7ca8 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/BucketingSortingOpProcFactory.java eac0edd ql/src/java/org/apache/hadoop/hive/ql/parse/RowResolver.java f142f3e ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 49eb83f ql/src/java/org/apache/hadoop/hive/ql/ppd/ExprWalkerProcFactory.java 4175d11 ql/src/java/org/apache/hadoop/hive/ql/session/LineageState.java e706f52 Diff: https://reviews.apache.org/r/21549/diff/ Testing ------- Thanks, Navis Ryu --===============1511337113004233664==--