hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Remus Rusanu (JIRA)" <>
Subject [jira] [Updated] (HIVE-16757) Use of deprecated getRows() instead of new estimateRowCount(RelMetadataQuery..) has serious performance impact
Date Wed, 31 May 2017 05:40:04 GMT


Remus Rusanu updated HIVE-16757:
       Resolution: Fixed
    Fix Version/s: 3.0.0
           Status: Resolved  (was: Patch Available)

Resolved with;a=commit;h=8aee8d4f2b124fcfa093724b4de0a54287a8084f

> Use of deprecated getRows() instead of new estimateRowCount(RelMetadataQuery..) has serious
performance impact
> --------------------------------------------------------------------------------------------------------------
>                 Key: HIVE-16757
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning
>            Reporter: Remus Rusanu
>            Assignee: Remus Rusanu
>             Fix For: 3.0.0
>         Attachments: HIVE-16757.01.patch, HIVE-16757.02.patch, HIVE-16757.03.patch, HIVE-16757.04.patch,
HIVE-16757.05.patch, HIVE-16757.06.patch
> Calling Calcite's {{RelMetadataQuery.instance()}} is very expensive because it places
a new memoization cache on the stack. Hidden in the deperecated {{AbstractRelNode.getRows()}}
call is a call to {{instance()}}. In hive we have a number of places where we're calling the
deprecated {{getRows()}} instead of the new API {{estimateRowCount(RelMetadataQuery mq)}}
which accepts the RelMetadataQuery, which most places we actually have it handy to pass. On
looking at the a complex query (49 joins) there are 2995340 calls to {{AbstractRelNode.getRows}},
each one busting the current memoization cache away.
> Was: -On complex queries HiveRelMdRowCount.getRowCount can get called many times. since
it does not memoize its result and the call is recursive, it results in an explosion of calls.
for example a query with 49 joins, during join ordering (LoptOtimizerJoinRule) the HiveRelMdRowCount.getRowCount
gets called 6442 as a top level call, but the recursivity exploded this to 501729 calls. Memoization
of the rezult would stop the recursion early. In my testing this reduced the join reordering
time for said query from 11s to <1s..-
> Note there is no need for {{HiveRelMdRowCount}} memoization because the function is called
in stacks similar to this:
> {code}
> 	at org.apache.hadoop.hive.ql.optimizer.calcite.stats.HiveRelMdRowCount.getRowCount(
> 	at GeneratedMetadataHandler_RowCount.getRowCount_$
> 	at GeneratedMetadataHandler_RowCount.getRowCount
> 	at org.apache.calcite.rel.metadata.RelMetadataQuery.getRowCount(
> 	at org.apache.calcite.rel.rules.LoptOptimizeJoinRule.swapInputs(
> 	at org.apache.calcite.rel.rules.LoptOptimizeJoinRule.createJoinSubtree(
> {code}
> and {{GeneratedMetadataHandler_RowCount.getRowCount}} handles memoization.

This message was sent by Atlassian JIRA

View raw message