hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Sankar Sivarama Subramaniyan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-13995) Hive generates inefficient metastore queries for TPCDS tables with 1800+ partitions leading to higher compile time
Date Tue, 19 Jul 2016 21:20:20 GMT

     [ https://issues.apache.org/jira/browse/HIVE-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hari Sankar Sivarama Subramaniyan updated HIVE-13995:
-----------------------------------------------------
    Attachment: HIVE-13995.5.patch

> Hive generates inefficient metastore queries for TPCDS tables with 1800+ partitions leading
to higher compile time
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-13995
>                 URL: https://issues.apache.org/jira/browse/HIVE-13995
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.2.0
>            Reporter: Nita Dembla
>            Assignee: Hari Sankar Sivarama Subramaniyan
>         Attachments: HIVE-13995.1.patch, HIVE-13995.2.patch, HIVE-13995.3.patch, HIVE-13995.4.patch,
HIVE-13995.5.patch
>
>
> TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when the query
does not a filter on the partition column, metastore queries generated have a large IN clause
listing all the partition names. Most RDBMS systems have issues optimizing large IN clause
and even when a good index plan is chosen , comparing to 1800+ string values will not lead
to best execution time.
> When all partitions are chosen, not specifying the partition list and having filters
only on table and column name will generate the same result set as long as there are no concurrent
modifications to partition list of the hive table (adding/dropping partitions).
> For eg: For TPCDS query18, the metastore query gathering partition column statistics
runs in 0.5 secs in Mysql. Following is output from mysql log
> {noformat}
> -- Query_time: 0.482063  Lock_time: 0.003037 Rows_sent: 1836  Rows_examined: 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
>  where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales'

>  and "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
and "PARTITION_NAME" in 
> ('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654')
group by "PARTITION_NAME";
> {noformat}
> Functionally equivalent query runs in 0.1 seconds
> {noformat}
> --Query_time: 0.121296  Lock_time: 0.000156 Rows_sent: 1836  Rows_examined: 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
>  where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales'
 and "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>  group by "PARTITION_NAME";
> {noformat}
> If removing the partition list seems drastic, its also possible to simply list the range
since hive gets a ordered list of partition names. This performs equally well as earlier query
> {noformat}
> # Query_time: 0.143874  Lock_time: 0.000154 Rows_sent: 1836  Rows_examined: 18360
> SET timestamp=1464014881;
> select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000'
and "TABLE_NAME" = 'catalog_sales'  and "COLUMN_NAME" in 
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>   and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <= 'cs_sold_date_sk=2452654'

> group by "PARTITION_NAME";
> {noformat}
> Another thing to check is the IN clause of column names. Columns in projection list of
hive query are mentioned here. Not sure if statistics of these columns are required for hive
query optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message