hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Namit Jain (Commented) (JIRA)" <>
Subject [jira] [Commented] (HIVE-2621) Allow multiple group bys with the same input data and spray keys to be run on the same reducer.
Date Thu, 22 Dec 2011 16:49:30 GMT


Namit Jain commented on HIVE-2621:

Let me take a look at the code again:

But the general flow should be as follows:

if  hive.multigroupby.singlereducer is true (which should always be),
  find common distincts. 
    (or the check hive.multigroupby.singlereducer can be done inside find common distincts
function itself)
  if common distincts == null
     old (current) approach - map side aggr should be used
     new code path

What do you think ? That way, we are guaranteed that the existing behavior is not changed.
This new parameter is only affecting distincts, and we it is very easy to turn it off

I know the code is kind of messy here, but can you spend some time to modularize it,
and reuse as much as possible ?

> Allow multiple group bys with the same input data and spray keys to be run on the same
> -----------------------------------------------------------------------------------------------
>                 Key: HIVE-2621
>                 URL:
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Kevin Wilfong
>            Assignee: Kevin Wilfong
>         Attachments: HIVE-2621.1.patch.txt, HIVE-2621.D567.1.patch, HIVE-2621.D567.2.patch,
> Currently, when a user runs a query, such as a multi-insert, where each insertion subclause
consists of a simple query followed by a group by, the group bys for each clause are run on
a separate reducer.  This requires writing the data for each group by clause to an intermediate
file, and then reading it back.  This uses a significant amount of the total CPU consumed
by the query for an otherwise simple query.
> If the subclauses are grouped by their distinct expressions and group by keys, with all
of the group by expressions for a group of subclauses run on a single reducer, this would
reduce the amount of reading/writing to intermediate files for some queries.
> To do this, for each group of subclauses, in the mapper we would execute a the filters
for each subclause 'or'd together (provided each subclause has a filter) followed by a reduce
sink.  In the reducer, the child operators would be each subclauses filter followed by the
group by and any subsequent operations.
> Note that this would require turning off map aggregation, so we would need to make using
this type of plan configurable.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:!default.jspa
For more information on JIRA, see:


View raw message