hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rui Li (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17287) HoS can not deal with skewed data group by
Date Fri, 11 Aug 2017 08:36:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16123036#comment-16123036
] 

Rui Li commented on HIVE-17287:
-------------------------------

OK that seems a skewed shuffle to me. You can run some statistics on the group key to confirm,
in case there's some issue like HIVE-17114.
Besides, what will the metrics look like if you enable {{hive.groupby.skewindata}}? That optimization
will shuffle twice for the group by. The 1st shuffle is partitioned randomly. You can verify
it in the explain:
{noformat}
                      Reduce Output Operator
                        key expressions: _col0 (type: string)
                        sort order: +
                        Map-reduce partition columns: rand() (type: double)
                        Statistics: Num rows: 58 Data size: 5812 Basic stats: COMPLETE Column
stats: NONE
{noformat}

> HoS can not deal with skewed data group by
> ------------------------------------------
>
>                 Key: HIVE-17287
>                 URL: https://issues.apache.org/jira/browse/HIVE-17287
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: query67-fail-at-groupby.png, query67-groupby_shuffle_metric.png
>
>
> In [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql],
fact table {{store_sales}} joins with small tables {{date_dim}}, {{item}},{{store}}. After
join, groupby the intermediate data.
> Here the data of {{store_sales}} on 3TB tpcds is skewed:  there are 1824 partitions.
The biggest partition is 25.7G and others are 715M.
> {code}
> hadoop fs -du -h /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales
> ....
> 715.0 M  /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639
> 713.9 M  /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640
> 714.1 M  /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641
> 712.9 M  /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642
> 25.7 G   /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
> {code}
> The skewed table {{store_sales}} caused the failed job. Is there any way to solve the
groupby problem of skewed table?  I tried to enable {{hive.groupby.skewindata}} to first divide
the data more evenly then start do group by. But the job still hangs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message