hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-15339) Prefetch column stats for fields needed in FilterSelectivityEstimator
Date Fri, 02 Dec 2016 11:12:59 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15714800#comment-15714800
] 

Rajesh Balamohan edited comment on HIVE-15339 at 12/2/16 11:12 AM:
-------------------------------------------------------------------

Attaching .1 patch.
{noformat}

Without any patch: (compile time 14.2 seconds)
==============
2016-12-02T05:20:34,867 DEBUG [cf8155ce-cf85-41b5-b0a3-4d4a6c75da5e main] log.PerfLogger:
</PERFLOG method=compile start=1480656020666 end=1480656034867 duration=14201 from=org.apache.hadoop.hive.ql.Driver>


With Patch: (compile time 10.6 seconds)
=========
2016-12-02T05:34:53,820 DEBUG [bfe87e40-4260-4f67-9e84-cd89694be1ad main] log.PerfLogger:
</PERFLOG method=compile start=1480656883196 end=1480656893820 duration=10624 from=org.apache.hadoop.hive.ql.Driver>

{noformat}

metastore DB was hosted in postgres and flights table has around 7000 partitions. Prefetch
is a wrong term in the jira. Patch tries to send all the needed columns in same call, and
in other side these columns stats get cached in AggregateColStats. Any col stats call fired
later fetches the data from the cache itself making it faster.



\cc [~pxiong], [~ashutoshc], [~jcamachorodriguez]



was (Author: rajesh.balamohan):
Attaching .1 patch.
{noformat}

Without any patch: (compile time 14.2 seconds)
==============
2016-12-02T05:20:34,867 DEBUG [cf8155ce-cf85-41b5-b0a3-4d4a6c75da5e main] log.PerfLogger:
</PERFLOG method=compile start=1480656020666 end=1480656034867 duration=14201 from=org.apache.hadoop.hive.ql.Driver>


With Patch: (compile time 10.6 seconds)
=========
2016-12-02T05:34:53,820 DEBUG [bfe87e40-4260-4f67-9e84-cd89694be1ad main] log.PerfLogger:
</PERFLOG method=compile start=1480656883196 end=1480656893820 duration=10624 from=org.apache.hadoop.hive.ql.Driver>

{noformat}

metastore DB was hosted in postgres and flights table has around 7000 partitions.



\cc [~pxiong], [~ashutoshc], [~jcamachorodriguez]


> Prefetch column stats for fields needed in FilterSelectivityEstimator
> ---------------------------------------------------------------------
>
>                 Key: HIVE-15339
>                 URL: https://issues.apache.org/jira/browse/HIVE-15339
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HIVE-15339.1.patch
>
>
> Based on query pattern, {{FilterSelectivityEstimator}} gets column statistics from metastore
in multiple calls. For instance, in the following query, it ends up getting individual column
statistics for for flights multiple number of times.
> When the table has large number of partitions, getting statistics for columns via multiple
calls can be very expensive. This would adversely impact the overall compilation time. The
following query took 14 seconds to compile.
> {noformat}
> SELECT COUNT(`flights`.`flightnum`) AS `cnt_flightnum_ok`,
> YEAR(`flights`.`dateofflight`) AS `yr_flightdate_ok`
> FROM `flights` as `flights`
> JOIN `airlines` ON (`flights`.`uniquecarrier` = `airlines`.`code`)
> JOIN `airports` as `source_airport` ON (`flights`.`origin` = `source_airport`.`iata`)
> JOIN `airports` as `dest_airport` ON (`flights`.`dest` = `dest_airport`.`iata`)
> GROUP BY YEAR(`flights`.`dateofflight`);
> {noformat}
> It may be helpful to club all columns that need statistics and fetch these details in
single remote call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message