hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matt McCline (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-7405) Vectorize Reduce-Side GroupBy
Date Tue, 15 Jul 2014 00:09:04 GMT
Matt McCline created HIVE-7405:
----------------------------------

             Summary: Vectorize Reduce-Side GroupBy
                 Key: HIVE-7405
                 URL: https://issues.apache.org/jira/browse/HIVE-7405
             Project: Hive
          Issue Type: Bug
            Reporter: Matt McCline
            Assignee: Matt McCline



Take advantage of the fact that in most plans a reduce-side GroupBy will get the group keys
in sorted order so aggregation can be done "streaming" and not require large buffering of
intermediate aggregation in memory/storage.

Push any case requiring large buffering -- e.g. COUNT(DISTINCT(..)) -- to part 2 of Vectorize
Reduce-Side GroupBy.  In theory, if there is only one COUNT(DISTINCT(..)) the optimizer could
arrange for sorting on the distinct column(s) as subordinate sort key and do the count of
each distinct column(s) as a "streaming" operation.  Then, only multiple COUNT(DISTINCT(..))
would require large buffering.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message