hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bing Li (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (HIVE-17004) Calculating Number Of Reducers Looks At All Files
Date Sat, 01 Jul 2017 16:26:02 GMT

     [ https://issues.apache.org/jira/browse/HIVE-17004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Bing Li reassigned HIVE-17004:
------------------------------

    Assignee: Bing Li

> Calculating Number Of Reducers Looks At All Files
> -------------------------------------------------
>
>                 Key: HIVE-17004
>                 URL: https://issues.apache.org/jira/browse/HIVE-17004
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>    Affects Versions: 2.1.1
>            Reporter: BELUGA BEHR
>            Assignee: Bing Li
>
> When calculating the number of Mappers and Reducers, the two algorithms are looking at
different data sets.  The number of Mappers are calculated based on the number of splits and
the number of Reducers are based on the number of files within the HDFS directory.  What you
see is that if I add files to a sub-directory of the HDFS directory, the number of splits
remains the same since I did not tell Hive to search recursively, and the number of Reducers
increases.  Please improve this so that Reducers are looking at the same files that are considered
for splits and not at files within sub-directories (unless configured to do so).
> {code}
> CREATE EXTERNAL TABLE Complaints (
>   a string,
>   b string,
>   c string,
>   d string,
>   e string,
>   f string,
>   g string
> )
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '/user/admin/complaints';
> {code}
> {code}
> [root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.1.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.2.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.3.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.4.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.5.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.csv
> {code}
> {code}
> INFO  : Compiling command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae):
select a, count(1) from complaints group by a limit 10
> INFO  : Semantic Analysis Completed
> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a, type:string,
comment:null), FieldSchema(name:_c1, type:bigint, comment:null)], properties:null)
> INFO  : Completed compiling command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae);
Time taken: 0.077 seconds
> INFO  : Executing command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae):
select a, count(1) from complaints group by a limit 10
> INFO  : Query ID = hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Number of reduce tasks not specified. Estimated from input data size: 11
> INFO  : In order to change the average load for a reducer (in bytes):
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
> INFO  : In order to limit the maximum number of reducers:
> INFO  :   set hive.exec.reducers.max=<number>
> INFO  : In order to set a constant number of reducers:
> INFO  :   set mapreduce.job.reduces=<number>
> INFO  : number of splits:2
> INFO  : Submitting tokens for job: job_1493729203063_0003
> INFO  : The url to track the job: http://host:8088/proxy/application_1493729203063_0003/
> INFO  : Starting Job = job_1493729203063_0003, Tracking URL = http://host:8088/proxy/application_1493729203063_0003/
> INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop
job  -kill job_1493729203063_0003
> INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers:
11
> INFO  : 2017-05-02 14:20:14,206 Stage-1 map = 0%,  reduce = 0%
> INFO  : 2017-05-02 14:20:22,520 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.48
sec
> INFO  : 2017-05-02 14:20:34,029 Stage-1 map = 100%,  reduce = 27%, Cumulative CPU 15.72
sec
> INFO  : 2017-05-02 14:20:35,069 Stage-1 map = 100%,  reduce = 55%, Cumulative CPU 21.94
sec
> INFO  : 2017-05-02 14:20:36,110 Stage-1 map = 100%,  reduce = 64%, Cumulative CPU 23.97
sec
> INFO  : 2017-05-02 14:20:39,233 Stage-1 map = 100%,  reduce = 73%, Cumulative CPU 25.26
sec
> INFO  : 2017-05-02 14:20:43,392 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 30.9
sec
> INFO  : MapReduce Total cumulative CPU time: 30 seconds 900 msec
> INFO  : Ended Job = job_1493729203063_0003
> INFO  : MapReduce Jobs Launched: 
> INFO  : Stage-Stage-1: Map: 2  Reduce: 11   Cumulative CPU: 30.9 sec   HDFS Read: 735691149
HDFS Write: 153 SUCCESS
> INFO  : Total MapReduce CPU Time Spent: 30 seconds 900 msec
> INFO  : Completed executing command(queryId=hive_20170502142020_dfcf77ef-56b7-4544-ab90-6e9726ea86ae);
Time taken: 36.035 seconds
> INFO  : OK
> {code}
> {code}
> [root@host ~]# sudo -u hdfs hdfs dfs -ls -R /user/admin/complaints
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.1.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.2.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.3.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.4.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.5.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:12 /user/admin/complaints/Consumer_Complaints.csv
> drwxr-xr-x   - admin admin          0 2017-05-02 14:16 /user/admin/complaints/t
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.1.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.2.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.3.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.4.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.5.csv
> -rwxr-xr-x   2 admin admin  122607137 2017-05-02 14:16 /user/admin/complaints/t/Consumer_Complaints.csv
> {code}
> {code}
> INFO  : Compiling command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e):
select a, count(1) from complaints group by a limit 10
> INFO  : Semantic Analysis Completed
> INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a, type:string,
comment:null), FieldSchema(name:_c1, type:bigint, comment:null)], properties:null)
> INFO  : Completed compiling command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e);
Time taken: 0.073 seconds
> INFO  : Executing command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e):
select a, count(1) from complaints group by a limit 10
> INFO  : Query ID = hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e
> INFO  : Total jobs = 1
> INFO  : Launching Job 1 out of 1
> INFO  : Starting task [Stage-1:MAPRED] in serial mode
> INFO  : Number of reduce tasks not specified. Estimated from input data size: 22
> INFO  : In order to change the average load for a reducer (in bytes):
> INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
> INFO  : In order to limit the maximum number of reducers:
> INFO  :   set hive.exec.reducers.max=<number>
> INFO  : In order to set a constant number of reducers:
> INFO  :   set mapreduce.job.reduces=<number>
> INFO  : number of splits:2
> INFO  : Submitting tokens for job: job_1493729203063_0004
> INFO  : The url to track the job: http://host:8088/proxy/application_1493729203063_0004/
> INFO  : Starting Job = job_1493729203063_0004, Tracking URL = http://host:8088/proxy/application_1493729203063_0004/
> INFO  : Kill Command = /opt/cloudera/parcels/CDH-5.8.4-1.cdh5.8.4.p0.5/lib/hadoop/bin/hadoop
job  -kill job_1493729203063_0004
> INFO  : Hadoop job information for Stage-1: number of mappers: 2; number of reducers:
22
> INFO  : 2017-05-02 14:29:27,464 Stage-1 map = 0%,  reduce = 0%
> INFO  : 2017-05-02 14:29:36,829 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 10.2
sec
> INFO  : 2017-05-02 14:29:47,287 Stage-1 map = 100%,  reduce = 14%, Cumulative CPU 15.36
sec
> INFO  : 2017-05-02 14:29:49,381 Stage-1 map = 100%,  reduce = 27%, Cumulative CPU 20.76
sec
> INFO  : 2017-05-02 14:29:50,433 Stage-1 map = 100%,  reduce = 32%, Cumulative CPU 22.69
sec
> INFO  : 2017-05-02 14:29:56,743 Stage-1 map = 100%,  reduce = 45%, Cumulative CPU 27.73
sec
> INFO  : 2017-05-02 14:30:00,916 Stage-1 map = 100%,  reduce = 64%, Cumulative CPU 34.95
sec
> INFO  : 2017-05-02 14:30:06,142 Stage-1 map = 100%,  reduce = 77%, Cumulative CPU 41.49
sec
> INFO  : 2017-05-02 14:30:10,297 Stage-1 map = 100%,  reduce = 82%, Cumulative CPU 42.92
sec
> INFO  : 2017-05-02 14:30:11,334 Stage-1 map = 100%,  reduce = 86%, Cumulative CPU 45.24
sec
> INFO  : 2017-05-02 14:30:12,365 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 50.33
sec
> INFO  : MapReduce Total cumulative CPU time: 50 seconds 330 msec
> INFO  : Ended Job = job_1493729203063_0004
> INFO  : MapReduce Jobs Launched: 
> INFO  : Stage-Stage-1: Map: 2  Reduce: 22   Cumulative CPU: 50.33 sec   HDFS Read: 735731640
HDFS Write: 153 SUCCESS
> INFO  : Total MapReduce CPU Time Spent: 50 seconds 330 msec
> INFO  : Completed executing command(queryId=hive_20170502142929_66a476e5-0591-4abe-92b7-bd3e4973466e);
Time taken: 51.841 seconds
> INFO  : OK
> {code}
> https://github.com/apache/hive/blob/bc510f63de9d6baab3a5ad8a4bf4eed9c6fde8b1/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L2959
> Number of splits (Mappers) stay the same between the two runs, number of Reducers increases.
> *INFO  : number of splits:2*
> # Number of reduce tasks not specified. Estimated from input data size: 11
> # Number of reduce tasks not specified. Estimated from input data size: 22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message