hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "@Sanjiv Singh" <sanjiv.is...@gmail.com>
Subject Re: Query Performance Issue : Group By and Distinct and load on reducer
Date Tue, 28 Jun 2016 19:56:41 GMT
Hi Dudu,

You are correct ...ROW_NUMBER() is main culprit.

ROW_NUMBER() OVER Not Fast Enough With Large Result Set, any good solution?



Regards
Sanjiv Singh
Mob :  +091 9990-447-339

On Tue, Jun 28, 2016 at 3:42 PM, Markovitz, Dudu <dmarkovitz@paypal.com>
wrote:

> The row_number operation seems to be skewed.
>
>
>
> Dudu
>
>
>
> *From:* @Sanjiv Singh [mailto:sanjiv.is.on@gmail.com]
> *Sent:* Tuesday, June 28, 2016 8:54 PM
> *To:* user@hive.apache.org
> *Subject:* Query Performance Issue : Group By and Distinct and load on
> reducer
>
>
>
> Hi All,
>
>
> I am having performance issue with data skew of the distinct statement in
> Hive
> <http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>.
> See below query with DISTINCT operator.
>
> *Original Query : *
>
>
>
> SELECT DISTINCT
>
>                  SD.REGION
>
>                                 ,SD.HEADEND
>                                 ,SD.NETWORK
>                                 ,SD.RETAILUNITCODE
>                                 ,SD.LOGTIMEDATE
>                                 ,SD.SPOTKEY
>                                 ,SD.CRE_DT
>                                 ,CASE
>                                                 WHEN SD.LOGTIMEDATE IS NULL
>                                                                 THEN 'Y'
>                                                 ELSE 'N'
>                                                 END AS DROP_REASON
>                                 ,ROW_NUMBER() OVER (
>                                                 ORDER BY NULL
>                                                 ) AS ETL_ROW_ID
>                 FROM INTER_ETL AS SD;
>
>
>
> Table *INTER_ETL *used for query is big enough.
> From the logs , it seems that data skew for specific set of values ,
> causing one of reducer have to do all the job. I tried to achieve the same
> through GROUP BY still having the same issue.  Help me to understand the
> issue and resolution.
>
> *Query with Distinct V2 :*
>
>
>
> CREATE TEMPORARY TABLE ETL_TMP AS
> SELECT DISTINCT dt.*
> FROM (
>         SELECT SD.REGION
>                     ,SD.HEADEND
>                     ,SD.NETWORK
>                     ,SD.RETAILUNITCODE
>                     ,SD.LOGTIMEDATE
>                     ,SD.SPOTKEY
>                     ,SD.CRE_DT
>                     ,CASE
>                                 WHEN SD.LOGTIMEDATE IS NULL
>                                             THEN 'Y'
>                                 ELSE 'N'
>                                 END AS DROP_REASON
>                     ,ROW_NUMBER() OVER (
>                                 ORDER BY NULL
>                                 ) AS ETL_ROW_ID
>         FROM INTER_ETL AS SD
>         ) AS dt;
>
>
>
> Logs:
>
>
>
> INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
> INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
> INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
> INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
> INFO  : Map 1: 107/107  *Reducer 2: 417(+1)/418*  Reducer 3: 0(+418)/418
>
>
>
>
>
> *Query With Group By:*
>
>
>
> CREATE TEMPORARY TABLE ETL_TMP AS
> SELECT REGION
>                     ,HEADEND
>                     ,NETWORK
>                     ,RETAILUNITCODE
>                     ,LOGTIMEDATE
>                     ,SPOTKEY
>                     ,CRE_DT
>                     ,DROP_REASON
>                     ,ETL_ROW_ID
> FROM (
>         SELECT SD.REGION
>                     ,SD.HEADEND
>                     ,SD.NETWORK
>                     ,SD.RETAILUNITCODE
>                     ,SD.LOGTIMEDATE
>                     ,SD.SPOTKEY
>                     ,SD.CRE_DT
>                     ,CASE
>                                 WHEN SD.LOGTIMEDATE IS NULL
>                                             THEN 'Y'
>                                 ELSE 'N'
>                                 END AS DROP_REASON
>                     ,ROW_NUMBER() OVER (
>                                 ORDER BY NULL
>                                 ) AS ETL_ROW_ID
>         FROM INTER_ETL AS SD
>         ) AS dt
> GROUP BY
>          REGION
>                     ,HEADEND
>                     ,NETWORK
>                     ,RETAILUNITCODE
>                     ,LOGTIMEDATE
>                     ,SPOTKEY
>                     ,CRE_DT
>                     ,DROP_REASON
>                     ,ETL_ROW_ID;
>
>
>
> Logs:
>
>
>
> INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
> INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
> INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
> INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
> INFO  : Map 1: 818/818  *Reducer 2: 417(+1)/418*  Reducer 3: 0(+418)/418
>
>
>
> *Table details :*
>
>
>
> Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ;
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------+--+
>
> |                                                                 DFS
> Output                                                                 |
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------+--+
>
> | Found 15 items
>                                                                   |
>
> | -rwxrwxrwx   3 Z56034 hdfs 2075244899 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000000_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 2090030620 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000001_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 2025516774 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000002_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1986848213 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000003_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 2018883723 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000004_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1984690335 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000005_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1987494444 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000006_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1974674515 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000007_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1963720218 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000008_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1965892384 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000009_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1974272622 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000010_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1971948208 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000011_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1968141886 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000012_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs 1970930771 2016-06-28 10:24
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000013_0  |
>
> | -rwxrwxrwx   3 Z56034 hdfs  192820628 2016-06-28 10:23
> /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000014_0  |
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------+--+
>
>  'numFiles'='15',
>
>
>  'numRows'='108363614',
>
>
>
>
>
>
>
> Regards
> Sanjiv Singh
> Mob :  +091 9990-447-339
>

Mime
View raw message