hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kostyrka <andr...@kostyrka.org>
Subject Re: hadoop 0.17.1 reducer not fetching map output problem
Date Thu, 24 Jul 2008 18:39:46 GMT
On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> Could you try to kill the tasktracker hosting the task the next time when
> it happens? I just want to isolate the problem - whether it is a problem in
> the TT-JT communication or in the Task-TT communication. From your
> description it looks like the problem is between the JT-TT communication.
> But pls run the experiment when it happens again and let us know what
> happens.

Well, I did restart the tasktracker where the reduce job was running, but that 
lead only to a situation where the jobtracker did not restart the job, showed 
it as still running, and was not able to kill the reduce task via hadoop 
job -kill-task nor -fail-task.

I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer 
at another startup confirmed the whole batch of problems I've been 
experiencing, and for him 0.15 works for production.

<rant-mode>
No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16 
could get released? (I'm using streaming.jar, and with 0.16.x I've introduced 
reducing to our workloads, and before 0.16 failed >80% of the jobs with 
reducers not being able to get their output. 0.17.0 improved that to a point 
where one can, with some pain, e.g. restarting the cluster daily, not storing 
anything important on HDFS, only temporary data, ..., use it somehow for 
production, at least for small jobs.) So one wonders how 0.16 got released? 
Or was it meant only as developer-only bug fixing series?
</rant-mode>

Sorry, this has been driving me up the walls into an asylum till I compared 
notes with a collegue, and decided that I'm not crazy ;)

Andreas

>
> Thanks,
> Devaraj
>
> On 7/24/08 1:42 PM, "Andreas Kostyrka" <andreas@kostyrka.org> wrote:
> > Hi!
> >
> > I'm experiencing hung reducers, with the following symptoms:
> >> Task Logs: 'task_200807230647_0008_r_000009_1'
> >>
> >>
> >> stdout logs
> >>
> >>
> >>
> >> stderr logs
> >>
> >>
> >>
> >> syslog logs
> >>
> >> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
> >> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:16,074 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:21,084 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:26,094 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:31,104 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:36,114 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
> >> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
> >> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
> >> map-outputs from tasktracker and 0 map-outputs from previous failures
> >> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
> >> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
> >> scheduling... 2008-07-24 07:56:41,126 INFO
> >> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
> >> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> >
> > Notice how it needs 6 map outputs, all map tasks have finished, and it
> > still just hangs there.
> >
> > The second speculative copy of that reducer task needs 14 map outputs
> > with the same messages :(
> >
> > Other observations:
> >
> > killing the reduce tasks via job -killtask ends up with restarting the
> > job on the same node, and curiously the new job gets jammed at the same
> > position (6/14 maps needed).
> >
> > The only remedy to this problem seems to be a complete restart of the
> > cluster and reprocessing. That gets really boring with jobs that took a
> > day to process first :(
> >
> > Andreas



Mime
View raw message