hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Leonardi <ivanleona...@gmail.com>
Subject Re: Reduce task progress at 100% but status is still RUNNING
Date Sat, 18 Dec 2010 14:42:00 GMT
Yes, I'm performing computation on large graphs (3.7M of nodes) and
set the numReduceTasks to 60. In attachment there is the code. In line
23 I'm using a grouping function that computes the group ID of a given
node ID: the number of groups actually corresponds to the
numReduceTasks parameter. Am I doing it wrong, maybe? If I use e.g. 20
reduce tasks, I got memory problems of Java heap, since a reducer task
receives all the information about a group (thus in the order of
3.7M/${groups_count}).

2010/12/18 Hari Sreekumar <hsreekumar@clickable.com>:
> Hi Ivan,
>       There are two possibilities - First is that you are doing some very
> heavy calculation in the reduce phase which is taking too long. In this
> case, you can try increasing the timeout with these config params:
>
> mapreduce.task.timeout
>
> mapreduce.tasktracker.healthchecker.script.timeout
>Java heap space
> mapreduce.jobtracker.expire.trackers.interval
>
> The default is 600000 ms
> (http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/java/mapred-default.xml)
>
> Second possibility, which I think is more likely, is that you have a bug in
> reduce code which is taking too long. Maybe an infinite loop or some bad
> code? What are you trying to do in reduce phase? Can you post some code?
>
> Thanks,
>
> Hari
>
> On Sat, Dec 18, 2010 at 7:30 PM, Harsh J <qwertymaniac@gmail.com> wrote:
>>
>> Are you doing much in your reducer? Try reporting a string back at
>> various progress points of your reducer functionality (like a status
>> update).
>>
>> A task would time-out if it has not {written anything / read anything
>> / reported anything about its status}.
>>
>> On Sat, Dec 18, 2010 at 7:24 PM, Ivan Leonardi <ivanleonardi@gmail.com>
>> wrote:
>> > Hi all,
>> > I'm running Hadoop 0.20.2 in a cluster setup built by 4 machines, and
>> > I'm trying to run my own job. Here is my configuration:
>> >
>> > core-site.xml
>> > <configuration>
>> > <property>
>> > <name>fs.default.name</name><value>hdfs://net-server00:44310</value>
>> > </property>
>> > <property>
>> > <name>hadoop.tmp.dir</name><value>/tmp/hadoop-leonardi</value>
>> > </property>
>> > </configuration>
>> >
>> > hdfs-site.xml
>> > <configuration>
>> > <property>
>> > <name>dfs.replication</name><value>1</value>
>> > </property>
>> > </configuration>
>> >
>> > mapred-site.xml
>> > <configuration>
>> > <property>
>> > <name>mapred.job.tracker</name><value>net-server00:44311</value>
>> > </property>
>> > <property>
>> > <name>mapred.tasktracker.map.tasks.maximum</name><value>3</value>
>> > </property>
>> > <property>
>> > <name>mapred.tasktracker.reduce.tasks.maximum</name><value>3</value>
>> > </property>
>> > <property>
>> > <name>tasktracker.http.threads</name><value>60</value>
>> > </property>
>> > <property>
>> > <name>mapred.reduce.parallel.copies</name><value>50</value>
>> > </property>
>> > </configuration>
>> >
>> > My job runs normally at the beginning:
>> > 10/12/18 14:16:36 INFO input.FileInputFormat: Total input paths to
>> > process : 30
>> > 10/12/18 14:16:36 INFO mapred.JobClient: Running job:
>> > job_201012181359_0004
>> > 10/12/18 14:16:37 INFO mapred.JobClient:  map 0% reduce 0%
>> > 10/12/18 14:16:51 INFO mapred.JobClient:  map 20% reduce 0%
>> > 10/12/18 14:16:54 INFO mapred.JobClient:  map 40% reduce 0%
>> > 10/12/18 14:17:00 INFO mapred.JobClient:  map 60% reduce 0%
>> > 10/12/18 14:17:03 INFO mapred.JobClient:  map 80% reduce 2%
>> > 10/12/18 14:17:06 INFO mapred.JobClient:  map 80% reduce 5%
>> > 10/12/18 14:17:09 INFO mapred.JobClient:  map 90% reduce 7%
>> > 10/12/18 14:17:12 INFO mapred.JobClient:  map 100% reduce 9%
>> > 10/12/18 14:17:15 INFO mapred.JobClient:  map 100% reduce 10%
>> > 10/12/18 14:17:18 INFO mapred.JobClient:  map 100% reduce 11%
>> > 10/12/18 14:17:19 INFO mapred.JobClient:  map 100% reduce 13%
>> > 10/12/18 14:17:21 INFO mapred.JobClient:  map 100% reduce 21%
>> > 10/12/18 14:17:22 INFO mapred.JobClient:  map 100% reduce 23%
>> > 10/12/18 14:17:24 INFO mapred.JobClient:  map 100% reduce 28%
>> > 10/12/18 14:17:27 INFO mapred.JobClient:  map 100% reduce 33%
>> > 10/12/18 14:17:29 INFO mapred.JobClient:  map 100% reduce 35%
>> > 10/12/18 14:17:31 INFO mapred.JobClient:  map 100% reduce 38%
>> > 10/12/18 14:17:34 INFO mapred.JobClient:  map 100% reduce 40%
>> >
>> > at this point, this percentages are blocked for 10 minutes (600 secs).
>> > In the meanwhile I looked at the Task Tracker Status, I see 3 running
>> > reduce tasks in each machine's proper web page having Status=RUNNING
>> > and Progress=100%; here is a sample:
>> > TaskAttempts                           Status    Progress
>> > attempt_201012181359_0004_r_000001_0   RUNNING   100.00%
>> > attempt_201012181359_0004_r_000003_0   RUNNING   100.00%
>> > attempt_201012181359_0004_r_000006_0   RUNNING   100.00%
>> >
>> > Therefore, after 600 secs, the console continues the logging by
>> > emitting the following:
>> > 10/12/18 14:27:34 INFO mapred.JobClient:  map 100% reduce 36%
>> > 10/12/18 14:27:36 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000006_0, Status : FAILED Task
>> > attempt_201012181359_0004_r_000006_0 failed to report status for 602
>> > seconds. Killing!
>> > 10/12/18 14:27:38 INFO mapred.JobClient:  map 100% reduce 23%
>> > 10/12/18 14:27:40 INFO mapred.JobClient:  map 100% reduce 6%
>> > 10/12/18 14:27:40 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000007_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000007_0 failed to report status for
>> > 601 seconds. Killing!
>> > 10/12/18 14:27:40 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000010_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000010_0 failed to report status for
>> > 601 seconds. Killing!
>> > 10/12/18 14:27:41 INFO mapred.JobClient:  map 100% reduce 0%
>> > 10/12/18 14:27:42 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000000_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000000_0 failed to report status for
>> > 600 seconds. Killing!
>> > 10/12/18 14:27:42 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000009_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000009_0 failed to report status for
>> > 600 seconds. Killing!
>> > 10/12/18 14:27:43 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000001_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000001_0 failed to report status for
>> > 601 seconds. Killing!
>> > 10/12/18 14:27:43 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000003_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000003_0 failed to report status for
>> > 602 seconds. Killing!
>> > 10/12/18 14:27:45 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000004_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000004_0 failed to report status for
>> > 601 seconds. Killing!
>> > 10/12/18 14:27:45 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000008_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000008_0 failed to report status for
>> > 601 seconds. Killing!
>> > 10/12/18 14:27:46 INFO mapred.JobClient:  map 100% reduce 1%
>> > 10/12/18 14:27:46 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000011_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000011_0 failed to report status for
>> > 601 seconds. Killing!
>> > 10/12/18 14:27:46 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000002_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000002_0 failed to report status for
>> > 602 seconds. Killing!
>> > 10/12/18 14:27:48 INFO mapred.JobClient: Task Id :
>> > attempt_201012181359_0004_r_000005_0, Status : FAILED
>> > Task attempt_201012181359_0004_r_000005_0 failed to report status for
>> > 602 seconds. Killing!
>> > 10/12/18 14:27:49 INFO mapred.JobClient:  map 100% reduce 2%
>> > 10/12/18 14:27:55 INFO mapred.JobClient:  map 100% reduce 4%
>> > 10/12/18 14:27:56 INFO mapred.JobClient:  map 100% reduce 6%
>> > 10/12/18 14:27:58 INFO mapred.JobClient:  map 100% reduce 9%
>> > 10/12/18 14:28:00 INFO mapred.JobClient:  map 100% reduce 14%
>> > 10/12/18 14:28:02 INFO mapred.JobClient:  map 100% reduce 22%
>> > 10/12/18 14:28:03 INFO mapred.JobClient:  map 100% reduce 25%
>> > 10/12/18 14:28:05 INFO mapred.JobClient:  map 100% reduce 31%
>> > 10/12/18 14:28:06 INFO mapred.JobClient:  map 100% reduce 33%
>> > 10/12/18 14:28:08 INFO mapred.JobClient:  map 100% reduce 36%
>> > 10/12/18 14:28:09 INFO mapred.JobClient:  map 100% reduce 37%
>> > 10/12/18 14:28:11 INFO mapred.JobClient:  map 100% reduce 38%
>> > 10/12/18 14:28:12 INFO mapred.JobClient:  map 100% reduce 39%
>> > 10/12/18 14:28:14 INFO mapred.JobClient:  map 100% reduce 40%
>> >
>> > and such behaviour continues.
>> >
>> > Is there anyone had the same problem and can help me?
>> > Thank you all!
>> >
>> > Ivan
>> >
>>
>>
>>
>> --
>> Harsh J
>> www.harshj.com
>
>

Mime
View raw message