hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markovitz, Dudu" <dmarkov...@paypal.com>
Subject RE: Query Performance Issue : Group By and Distinct and load on reducer
Date Tue, 28 Jun 2016 19:42:34 GMT
The row_number operation seems to be skewed.

Dudu

From: @Sanjiv Singh [mailto:sanjiv.is.on@gmail.com]
Sent: Tuesday, June 28, 2016 8:54 PM
To: user@hive.apache.org
Subject: Query Performance Issue : Group By and Distinct and load on reducer

Hi All,

I am having performance issue with data skew of the distinct statement in Hive<http://stackoverflow.com/questions/37894023/understanding-the-data-skew-of-the-countdistinct-statement-in-hive>.
See below query with DISTINCT operator.
Original Query :

SELECT DISTINCT
                 SD.REGION
                                ,SD.HEADEND
                                ,SD.NETWORK
                                ,SD.RETAILUNITCODE
                                ,SD.LOGTIMEDATE
                                ,SD.SPOTKEY
                                ,SD.CRE_DT
                                ,CASE
                                                WHEN SD.LOGTIMEDATE IS NULL
                                                                THEN 'Y'
                                                ELSE 'N'
                                                END AS DROP_REASON
                                ,ROW_NUMBER() OVER (
                                                ORDER BY NULL
                                                ) AS ETL_ROW_ID
                FROM INTER_ETL AS SD;

Table INTER_ETL used for query is big enough.
From the logs , it seems that data skew for specific set of values , causing one of reducer
have to do all the job. I tried to achieve the same through GROUP BY still having the same
issue.  Help me to understand the issue and resolution.
Query with Distinct V2 :

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT DISTINCT dt.*
FROM (
        SELECT SD.REGION
                    ,SD.HEADEND
                    ,SD.NETWORK
                    ,SD.RETAILUNITCODE
                    ,SD.LOGTIMEDATE
                    ,SD.SPOTKEY
                    ,SD.CRE_DT
                    ,CASE
                                WHEN SD.LOGTIMEDATE IS NULL
                                            THEN 'Y'
                                ELSE 'N'
                                END AS DROP_REASON
                    ,ROW_NUMBER() OVER (
                                ORDER BY NULL
                                ) AS ETL_ROW_ID
        FROM INTER_ETL AS SD
        ) AS dt;

Logs:

INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+56)/418
INFO  : Map 1: 107/107  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418


Query With Group By:

CREATE TEMPORARY TABLE ETL_TMP AS
SELECT REGION
                    ,HEADEND
                    ,NETWORK
                    ,RETAILUNITCODE
                    ,LOGTIMEDATE
                    ,SPOTKEY
                    ,CRE_DT
                    ,DROP_REASON
                    ,ETL_ROW_ID
FROM (
        SELECT SD.REGION
                    ,SD.HEADEND
                    ,SD.NETWORK
                    ,SD.RETAILUNITCODE
                    ,SD.LOGTIMEDATE
                    ,SD.SPOTKEY
                    ,SD.CRE_DT
                    ,CASE
                                WHEN SD.LOGTIMEDATE IS NULL
                                            THEN 'Y'
                                ELSE 'N'
                                END AS DROP_REASON
                    ,ROW_NUMBER() OVER (
                                ORDER BY NULL
                                ) AS ETL_ROW_ID
        FROM INTER_ETL AS SD
        ) AS dt
GROUP BY
         REGION
                    ,HEADEND
                    ,NETWORK
                    ,RETAILUNITCODE
                    ,LOGTIMEDATE
                    ,SPOTKEY
                    ,CRE_DT
                    ,DROP_REASON
                    ,ETL_ROW_ID;

Logs:

INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418
INFO  : Map 1: 818/818  Reducer 2: 417(+1)/418  Reducer 3: 0(+418)/418

Table details :

Beeline > dfs -ls /apps/hive/warehouse/PRD_DB.db/INTER_ETL ;
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
|                                                                 DFS Output             
                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
| Found 15 items                                                                         
                                                   |
| -rwxrwxrwx   3 Z56034 hdfs 2075244899 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000000_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 2090030620 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000001_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 2025516774 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000002_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1986848213 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000003_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 2018883723 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000004_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1984690335 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000005_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1987494444 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000006_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1974674515 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000007_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1963720218 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000008_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1965892384 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000009_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1974272622 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000010_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1971948208 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000011_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1968141886 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000012_0
 |
| -rwxrwxrwx   3 Z56034 hdfs 1970930771 2016-06-28 10:24 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000013_0
 |
| -rwxrwxrwx   3 Z56034 hdfs  192820628 2016-06-28 10:23 /apps/hive/warehouse/PRD_DB.db/INTER_ETL/000014_0
 |
+--------------------------------------------------------------------------------------------------------------------------------------------+--+
 'numFiles'='15',
 'numRows'='108363614',



Regards
Sanjiv Singh
Mob :  +091 9990-447-339
Mime
View raw message