hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Devaraj Das <d...@yahoo-inc.com>
Subject Re: hadoop 0.17.1 reducer not fetching map output problem
Date Thu, 24 Jul 2008 19:40:22 GMT



On 7/25/08 12:09 AM, "Andreas Kostyrka" <andreas@kostyrka.org> wrote:

> On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
>> Could you try to kill the tasktracker hosting the task the next time when
>> it happens? I just want to isolate the problem - whether it is a problem in
>> the TT-JT communication or in the Task-TT communication. From your
>> description it looks like the problem is between the JT-TT communication.
>> But pls run the experiment when it happens again and let us know what
>> happens.
> 
> Well, I did restart the tasktracker where the reduce job was running, but that
> lead only to a situation where the jobtracker did not restart the job, showed
> it as still running, and was not able to kill the reduce task via hadoop
> job -kill-task nor -fail-task.

The reduce task would eventually be reexecuted (after some timeout,
defaulting to 10 minutes, the tasktracker would be assumed as lost and all
reducers that were running on that node would be reexecuted).

> 
> I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A peer
> at another startup confirmed the whole batch of problems I've been
> experiencing, and for him 0.15 works for production.
> 
> <rant-mode>
> No question, 0.17 is way better than 0.16, on the other hand I wonder how 0.16
> could get released? (I'm using streaming.jar, and with 0.16.x I've introduced
> reducing to our workloads, and before 0.16 failed >80% of the jobs with
> reducers not being able to get their output. 0.17.0 improved that to a point
> where one can, with some pain, e.g. restarting the cluster daily, not storing
> anything important on HDFS, only temporary data, ..., use it somehow for
> production, at least for small jobs.) So one wonders how 0.16 got released?
> Or was it meant only as developer-only bug fixing series?
> </rant-mode>
>
Pls raise jiras for the specific problems.
 
> Sorry, this has been driving me up the walls into an asylum till I compared
> notes with a collegue, and decided that I'm not crazy ;)
> 
> Andreas
> 
>> 
>> Thanks,
>> Devaraj
>> 
>> On 7/24/08 1:42 PM, "Andreas Kostyrka" <andreas@kostyrka.org> wrote:
>>> Hi!
>>> 
>>> I'm experiencing hung reducers, with the following symptoms:
>>>> Task Logs: 'task_200807230647_0008_r_000009_1'
>>>> 
>>>> 
>>>> stdout logs
>>>> 
>>>> 
>>>> 
>>>> stderr logs
>>>> 
>>>> 
>>>> 
>>>> syslog logs
>>>> 
>>>> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
>>>> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:16,074 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:21,084 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:26,094 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:31,104 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:36,114 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>>>> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>>>> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>>>> map-outputs from tasktracker and 0 map-outputs from previous failures
>>>> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>>>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>>>> scheduling... 2008-07-24 07:56:41,126 INFO
>>>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>>>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
>>> 
>>> Notice how it needs 6 map outputs, all map tasks have finished, and it
>>> still just hangs there.
>>> 
>>> The second speculative copy of that reducer task needs 14 map outputs
>>> with the same messages :(
>>> 
>>> Other observations:
>>> 
>>> killing the reduce tasks via job -killtask ends up with restarting the
>>> job on the same node, and curiously the new job gets jammed at the same
>>> position (6/14 maps needed).
>>> 
>>> The only remedy to this problem seems to be a complete restart of the
>>> cluster and reprocessing. That gets really boring with jobs that took a
>>> day to process first :(
>>> 
>>> Andreas
> 
> 



Mime
View raw message