Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Thu, 21 Jul 2016 19:49:21 +0000 (UTC)
From: "Hari Sankar Sivarama Subramaniyan (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.12977778.1465598374000.99468.1469130561289@Atlassian.JIRA>
In-Reply-To: <JIRA.12977778.1465598374000@Atlassian.JIRA>
References: <JIRA.12977778.1465598374000@Atlassian.JIRA> <JIRA.12977778.1465598374632@arcas>
Subject: [jira] [Updated] (HIVE-13995) Hive generates inefficient metastore
 queries for TPCDS tables with 1800+ partitions leading to higher compile
 time
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Thu, 21 Jul 2016 19:49:28 -0000


     [ https://issues.apache.org/jira/browse/HIVE-13995?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hari Sankar Sivarama Subramaniyan updated HIVE-13995:
-----------------------------------------------------
    Status: Open  (was: Patch Available)

> Hive generates inefficient metastore queries for TPCDS tables with 1800+ =
partitions leading to higher compile time
> -------------------------------------------------------------------------=
-----------------------------------------
>
>                 Key: HIVE-13995
>                 URL: https://issues.apache.org/jira/browse/HIVE-13995
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: 2.2.0
>            Reporter: Nita Dembla
>            Assignee: Hari Sankar Sivarama Subramaniyan
>         Attachments: HIVE-13995.1.patch, HIVE-13995.2.patch, HIVE-13995.3=
.patch, HIVE-13995.4.patch, HIVE-13995.5.patch, HIVE-13995.6.patch, HIVE-13=
995.7.patch
>
>
> TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and =
when the query does not a filter on the partition column, metastore queries=
 generated have a large IN clause listing all the partition names. Most RDB=
MS systems have issues optimizing large IN clause and even when a good inde=
x plan is chosen , comparing to 1800+ string values will not lead to best e=
xecution time.
> When all partitions are chosen, not specifying the partition list and hav=
ing filters only on table and column name will generate the same result set=
 as long as there are no concurrent modifications to partition list of the =
hive table (adding/dropping partitions).
> For eg: For TPCDS query18, the metastore query gathering partition column=
 statistics runs in 0.5 secs in Mysql. Following is output from mysql log
> {noformat}
> -- Query_time: 0.482063  Lock_time: 0.003037 Rows_sent: 1836  Rows_examin=
ed: 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
>  where "DB_NAME" =3D 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" =
=3D 'catalog_sales'=20
>  and "COLUMN_NAME" in=20
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_=
list_price','cs_sales_price','cs_coupon_amt','cs_net_profit') and "PARTITIO=
N_NAME" in=20
> ('cs_sold_date_sk=3D2450815','cs_sold_date_sk=3D2450816','cs_sold_date_sk=
=3D2450817','cs_sold_date_sk=3D2450818','cs_sold_date_sk=3D2450819','cs_sol=
d_date_sk=3D2450820','cs_sold_date_sk=3D2450821','cs_sold_date_sk=3D2450822=
','cs_sold_date_sk=3D2450823','cs_sold_date_sk=3D2450824','cs_sold_date_sk=
=3D2450825','cs_sold_date_sk=3D2450826','cs_sold_date_sk=3D2450827','cs_sol=
d_date_sk=3D2450828','cs_sold_date_sk=3D2450829','cs_sold_date_sk=3D2450830=
','cs_sold_date_sk=3D2450831','cs_sold_date_sk=3D2450832','cs_sold_date_sk=
=3D2450833','cs_sold_date_sk=3D2450834','cs_sold_date_sk=3D2450835','cs_sol=
d_date_sk=3D2450836','cs_sold_date_sk=3D2450837','cs_sold_date_sk=3D2450838=
','cs_sold_date_sk=3D2450839','cs_sold_date_sk=3D2450840','cs_sold_date_sk=
=3D2450841','cs_sold_date_sk=3D2450842','cs_sold_date_sk=3D2450843','cs_sol=
d_date_sk=3D2450844','cs_sold_date_sk=3D2450845','cs_sold_date_sk=3D2450846=
','cs_sold_date_sk=3D2450847','cs_sold_date_sk=3D2450848','cs_sold_date_sk=
=3D2450849','cs_sold_date_sk=3D2450850','cs_sold_date_sk=3D2450851','cs_sol=
d_date_sk=3D2450852','cs_sold_date_sk=3D2450853','cs_sold_date_sk=3D2450854=
','cs_sold_date_sk=3D2450855','cs_sold_date_sk=3D2450856',...,'cs_sold_date=
_sk=3D2452654') group by "PARTITION_NAME";
> {noformat}
> Functionally equivalent query runs in 0.1 seconds
> {noformat}
> --Query_time: 0.121296  Lock_time: 0.000156 Rows_sent: 1836  Rows_examine=
d: 18360
> select count("COLUMN_NAME") from "PART_COL_STATS"
>  where "DB_NAME" =3D 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" =
=3D 'catalog_sales'  and "COLUMN_NAME" in=20
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_=
list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>  group by "PARTITION_NAME";
> {noformat}
> If removing the partition list seems drastic, its also possible to simply=
 list the range since hive gets a ordered list of partition names. This per=
forms equally well as earlier query
> {noformat}
> # Query_time: 0.143874  Lock_time: 0.000154 Rows_sent: 1836  Rows_examine=
d: 18360
> SET timestamp=3D1464014881;
> select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" =3D 'tp=
cds_bin_partitioned_orc_30000' and "TABLE_NAME" =3D 'catalog_sales'  and "C=
OLUMN_NAME" in=20
> ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_=
list_price','cs_sales_price','cs_coupon_amt','cs_net_profit')
>   and "PARTITION_NAME" >=3D 'cs_sold_date_sk=3D2450815' and "PARTITION_NA=
ME" <=3D 'cs_sold_date_sk=3D2452654'=20
> group by "PARTITION_NAME";
> {noformat}
> Another thing to check is the IN clause of column names. Columns in proje=
ction list of hive query are mentioned here. Not sure if statistics of thes=
e columns are required for hive query optimization.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)