Return-Path: X-Original-To: apmail-hive-issues-archive@minotaur.apache.org Delivered-To: apmail-hive-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CA45917E74 for ; Fri, 11 Sep 2015 00:51:46 +0000 (UTC) Received: (qmail 72989 invoked by uid 500); 11 Sep 2015 00:51:46 -0000 Delivered-To: apmail-hive-issues-archive@hive.apache.org Received: (qmail 72966 invoked by uid 500); 11 Sep 2015 00:51:46 -0000 Mailing-List: contact issues-help@hive.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hive.apache.org Delivered-To: mailing list issues@hive.apache.org Received: (qmail 72955 invoked by uid 99); 11 Sep 2015 00:51:46 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Sep 2015 00:51:46 +0000 Date: Fri, 11 Sep 2015 00:51:46 +0000 (UTC) From: "Sergey Shelukhin (JIRA)" To: issues@hive.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (HIVE-11794) GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/HIVE-11794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Shelukhin updated HIVE-11794: ------------------------------------ Description: The code in Vectorizer is as such: {noformat} boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH); {noformat} then, if it's reduce side: {noformat} if (isMergePartial) { // Reduce Merge-Partial GROUP BY. // A merge-partial GROUP BY is fed by grouping by keys from reduce-shuffle. It is the // first (or root) operator for its reduce task. .... } else { // Reduce Hash GROUP BY or global aggregation. ... {noformat} In fact, this logic is missing the COMPLETE mode. Both from the comment: {noformat} COMPLETE: complete 1-phase aggregation: iterate, terminate ... HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation ... PARTIAL1: partial aggregation - first phase: iterate, terminatePartial {noformat} and from the explain plan like this (the query has multiple stages of aggregations over a union; the mapper does a partial hash aggregation for each side of the union, which is then followed by mergepartial, and 2nd stage as complete): {noformat} Map Operator Tree: ... Group By Operator keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint) mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 Reduce Output Operator ... feeding into Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), KEY._col12 (type: bigint) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 Group By Operator aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), sum(_col10), sum(_col11), sum(_col12) keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int) mode: complete outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 {noformat} it seems like COMPLETE is actually the global aggregation, and HASH isn't (or may not be). So, it seems like reduce-side COMPLETE should be handled on the else-path of the above if. For map-side, it doesn't check mode at all as far as I can see. Not sure if additional code changes are necessary after that, it may just work. was: The code in Vectorizer is as such: {noformat} boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH); {noformat} then, if it's reduce side: {noformat} if (isMergePartial) { // Reduce Merge-Partial GROUP BY. // A merge-partial GROUP BY is fed by grouping by keys from reduce-shuffle. It is the // first (or root) operator for its reduce task. .... } else { // Reduce Hash GROUP BY or global aggregation. ... {noformat} In fact, this logic is missing the COMPLETE mode. Both from the comment: {noformat} COMPLETE: complete 1-phase aggregation: iterate, terminate ... HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation ... PARTIAL1: partial aggregation - first phase: iterate, terminatePartial {noformat} and from the explain plan like this (the query has multiple stages of aggregations over a union; the mapper does a partial hash aggregation for each side of the union, which is then followed by mergepartial, and 2nd stage as complete): {noformat} Map Operator Tree: ... Group By Operator keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint) mode: hash outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 Statistics: Num rows: 273117 Data size: 22941828 Basic stats: COMPLETE Column stats: PARTIAL Reduce Output Operator ... feeding into Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), KEY._col12 (type: bigint) mode: mergepartial outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 Group By Operator aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), sum(_col10), sum(_col11), sum(_col12) keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int) mode: complete outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 {noformat} it seems like COMPLETE is actually the global aggregation, and HASH isn't (or may not be). So, it seems like reduce-side COMPLETE should be handled on the else-path of the above if. For map-side, it doesn't check mode at all as far as I can see. Not sure if additional code changes are necessary after that, it may just work. > GBY vectorization appears to process COMPLETE reduce-side GBY incorrectly > ------------------------------------------------------------------------- > > Key: HIVE-11794 > URL: https://issues.apache.org/jira/browse/HIVE-11794 > Project: Hive > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Matt McCline > > The code in Vectorizer is as such: > {noformat} > boolean isMergePartial = (desc.getMode() != GroupByDesc.Mode.HASH); > {noformat} > then, if it's reduce side: > {noformat} > if (isMergePartial) { > // Reduce Merge-Partial GROUP BY. > // A merge-partial GROUP BY is fed by grouping by keys from reduce-shuffle. It is the > // first (or root) operator for its reduce task. > .... > } else { > // Reduce Hash GROUP BY or global aggregation. > ... > {noformat} > In fact, this logic is missing the COMPLETE mode. Both from the comment: > {noformat} > COMPLETE: complete 1-phase aggregation: iterate, terminate > ... > HASH: For non-distinct the same as PARTIAL1 but use hash-table-based aggregation > ... > PARTIAL1: partial aggregation - first phase: iterate, terminatePartial > {noformat} > and from the explain plan like this (the query has multiple stages of aggregations over a union; the mapper does a partial hash aggregation for each side of the union, which is then followed by mergepartial, and 2nd stage as complete): > {noformat} > Map Operator Tree: > ... > Group By Operator > keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int), _col5 (type: bigint), _col6 (type: bigint), _col7 (type: bigint), _col8 (type: bigint), _col9 (type: bigint), _col10 (type: bigint), _col11 (type: bigint), _col12 (type: bigint) > mode: hash > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 > Reduce Output Operator > ... > feeding into > Reduce Operator Tree: > Group By Operator > keys: KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2 (type: int), KEY._col3 (type: int), KEY._col4 (type: int), KEY._col5 (type: bigint), KEY._col6 (type: bigint), KEY._col7 (type: bigint), KEY._col8 (type: bigint), KEY._col9 (type: bigint), KEY._col10 (type: bigint), KEY._col11 (type: bigint), KEY._col12 (type: bigint) > mode: mergepartial > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 > Group By Operator > aggregations: sum(_col5), sum(_col6), sum(_col7), sum(_col8), sum(_col9), sum(_col10), sum(_col11), sum(_col12) > keys: _col0 (type: int), _col1 (type: int), _col2 (type: int), _col3 (type: int), _col4 (type: int) > mode: complete > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9, _col10, _col11, _col12 > {noformat} > it seems like COMPLETE is actually the global aggregation, and HASH isn't (or may not be). > So, it seems like reduce-side COMPLETE should be handled on the else-path of the above if. For map-side, it doesn't check mode at all as far as I can see. > Not sure if additional code changes are necessary after that, it may just work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)