hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gunther Hagleitner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-2340) optimize orderby followed by a groupby
Date Mon, 04 Feb 2013 21:32:13 GMT

    [ https://issues.apache.org/jira/browse/HIVE-2340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570645#comment-13570645
] 

Gunther Hagleitner commented on HIVE-2340:
------------------------------------------

[~navis]: I think in general the logic should be to copy numReducers from parent to child
not the other way around. If hive makes a decent estimate of reducers for the parent, that's
probably the number you want to carry into the combined reduce stage, because that means each
reducer is doing the desired amount of work. Buckets and order by are the only special cases
I can think of, where the number needs to be fixed.

For those special cases without knowing the cardinalities of join/group by/tables, it's indeed
difficult to guess if the optimization should be on or off. However, what do you think of
using a max ratio of parent reducers/child reducers instead of a fixed minimum number of reducers
for the child? With a default of 4 maybe. I.e.: If there are less than 4 times as many reducers
in the parent than in the child collapse (assuming another job will be more expensive than
the lower number of reducers), else leave it alone. The optimization is only good if the input
sizes of the child and parent reducers are similar and expressing this as a ratio of number
of reducers is probably the closest we can get right now.

This would enable the optimization for a larger body of queries (small tables, single input
split, empty group by expr, etc).
                
> optimize orderby followed by a groupby
> --------------------------------------
>
>                 Key: HIVE-2340
>                 URL: https://issues.apache.org/jira/browse/HIVE-2340
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Query Processor
>            Reporter: Navis
>            Assignee: Navis
>            Priority: Minor
>              Labels: perfomance
>         Attachments: ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.1.patch, ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.2.patch,
ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.3.patch, ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.4.patch,
ASF.LICENSE.NOT.GRANTED--HIVE-2340.D1209.5.patch, HIVE-2340.1.patch.txt, HIVE-2340.D1209.10.patch,
HIVE-2340.D1209.6.patch, HIVE-2340.D1209.7.patch, HIVE-2340.D1209.8.patch, HIVE-2340.D1209.9.patch,
testclidriver.txt
>
>
> Before implementing optimizer for JOIN-GBY, try to implement RS-GBY optimizer(cluster-by
following group-by).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message