hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Koji Noguchi <knogu...@yahoo-inc.com>
Subject Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?
Date Tue, 01 Feb 2011 16:33:47 GMT
(Bcc CDH alias)

> Please don't cross-post, CDH questions should go to their user lists.
>
Was this CDH specific?

Did the job show up as failed on the jobtracker webui?
If yes, can you grep a jobtracker log to see something like

2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.TaskInProgress: TaskInProgress task_201101040441_3330
49_r_000004 has failed 4 times.
2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Aborting job job_201101040441_333049
2011-02-01 00:06:41,510 INFO org.apache.hadoop.mapred.JobInProgress: Killing job 'job_201101040441_333049'

which tells you which task failure caused the job to fail.
Then you can look at the userlog of those task attempts to see why they failed.

Ideally this info should show up on the webui.

On the other hands, if the job just hang for hours, there's probably a bug on the framework.

Koji

On 1/31/11 9:36 PM, "Arun C Murthy" <acm@yahoo-inc.com> wrote:

Please don't cross-post, CDH questions should go to their user lists.

On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote:

Hi!

I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days of work, one of
the worker nodes just simply died (I cannot connect to the instance either). That node also
appears on the dfshealth as dead node.
 Until now everything is normal.

Unfortunately the job it was running didn't survived. The cluster it had 8 worker nodes, each
with 4 mappers and 2 reducers. The job in cause it had ~1200 map tasks and 10 reduce tasks.
 One of the node died and I see around 31 failed attempts in the jobtracker log.  The log
is very similar with the one  somebody placed it here: http://pastie.org/pastes/1270614

Some of the attempts (but not all!) has been retried and I saw at least two of them which
finally is getting in a successful state.
The following two lines appears several times in my jobtracker log:
2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Running list for reducers
missing!! Job details are missing.
 2011-01-29 15:50:34,956 WARN org.apache.hadoop.mapred.JobInProgress: Failed cache for reducers
missing!! Job details are missing.

These pair of log lines could be the signals that the job couldn't be finished by re-scheduling
the failed attempts.
 Nothing special I have seen in namenode logs.

Of course I rerun the failed job which finished successfully. But my problem is that I would
like to understand the failover conditions. What could be lost, which part of the hadoop is
not fault tolerant in this sense that it happens to see those warnings mentioned earlier.
Is there a chance to control such kind of situations?

I am using CDH3b3 version, so it is a developing version of Hadoop.
Somebody knows about a special bug or fix which in the near future can solve the problem?

Regards
Tibor Kiss






Mime
View raw message