hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesus Camacho Rodriguez (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-21690) Support outer joins with HiveAggregateJoinTransposeRule and turn it on by default
Date Mon, 06 May 2019 22:58:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834241#comment-16834241
] 

Jesus Camacho Rodriguez edited comment on HIVE-21690 at 5/6/19 10:57 PM:
-------------------------------------------------------------------------

{quote}
One approach to fix this is to localize the cost computation to the rule itself, i.e compute
the non-cumulative cost of existing aggregate and join and compare it with new cost of new
aggregates, join and top aggregate.
Better approach in my opinion would be to fix the cost model and take aggregate cost into
account (along with the join). This could affect other queries and can cause performance regression
but those will most likely be issues with the planning and should be investigated and fixed.
{quote}
In principle, second approach seems the logical choice since it takes the cost model closer
to execution actual cost. However, it can easily backfire with current implementation of join
reordering costing, which only considers cost for join operation and builds on that assumption.
Why should we consider only aggregate operators? What about other operators?
Before pushing such a change, I would argue that we need further evaluation on how this will
affect join reordering and the regressions that we will get wrt previous benchmarks.

Since cost model is pluggable, have you thought about creating a cost model that extends the
join reordering (default) one with cost calculation for the Aggregate operator? You could
use the new cost model when you trigger this rule. In a follow-up, you can study whether using
the same cost model for join reordering makes sense or not, and evaluate the merit of that
change for join reordering on its own.


was (Author: jcamachorodriguez):
{quote}
One approach to fix this is to localize the cost computation to the rule itself, i.e compute
the non-cumulative cost of existing aggregate and join and compare it with new cost of new
aggregates, join and top aggregate.
Better approach in my opinion would be to fix the cost model and take aggregate cost into
account (along with the join). This could affect other queries and can cause performance regression
but those will most likely be issues with the planning and should be investigated and fixed.
{quote}
In principle, second approach seems the logical choice since it takes the cost model closer
to execution actual cost. However, it can easily backfire with current implementation of join
reordering costing, which only considers cost for join operation and builds on that assumption.
Why should we consider only aggregate operators? What about other operators?
Before pushing such a change, I would argue that we need further evaluation on how this will
affect join reordering and the regressions that we will get wrt previous benchmarks.

Since cost model is pluggable, have you thought about creating a cost model that extends the
join reordering (default) one with cost calculation for the Aggregate operator? You could
use the new cost model when you trigger this rule. In a follow-up, you can study whether using
the same cost model for join reordering makes sense or not, and evaluate the merit of that
change for join reordering on its own.

> Support outer joins with HiveAggregateJoinTransposeRule and turn it on by default
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-21690
>                 URL: https://issues.apache.org/jira/browse/HIVE-21690
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Planning
>            Reporter: Vineet Garg
>            Assignee: Vineet Garg
>            Priority: Major
>         Attachments: HIVE-21690.1.patch
>
>
> 1) This optimization is off by default. We would like to turn on this optimization wherein
group by is pushed down to join, in some cases top aggregate is removed but in most of the
cases this optimization adds extra aggregate nodes. To measure if those extra aggregates are
beneficial or not (they might add extra overhead without reducing rows) cost is computed and
compared b/w previous plan and new plan.
> Since Hive's cost model only consider JOIN's cost and discard cost of rest of the nodes,
this comparison always favor new plan (since adding aggregate beneath join reduces the total
number of rows processed by the join and therefore reduces the join cost). Therefore turning
on this optimization with existing cost model is not a good idea.
> One approach to fix this is to localize the cost computation to the rule itself, i.e
compute the non-cumulative cost of existing aggregate and join and compare it with new cost
of new aggregates, join and top aggregate.
> Better approach in my opinion would be to fix the cost model and take aggregate cost
into account (along with the join). This could affect other queries and can cause performance
regression but those will most likely be issues with the planning and should be investigated
and fixed.
> 2) This optimization currently only support INNER JOIN. This can be extended to support
OUTER joins.
>  
> cc [~jcamachorodriguez] [~ashutoshc] [~gopalv]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message