hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3531) Hod does not report job tracker failure on hod client side when job tracker fails to come up
Date Wed, 11 Jun 2008 16:52:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12604270#action_12604270
] 

Hemanth Yamijala commented on HADOOP-3531:
------------------------------------------

Debugging this issue, I found the cause to be a timing problem that does not happen on all
machines.

In the HodRing, we have code that determines if a launched Hadoop command has exited with
a non-zero error code, and in such cases an error is reported back to the ringmaster. The
check is made soon after the command is launched. On some machines, the time limit between
launching the command and its exit with the error code is a few 100s of milliseconds. On such
machines, the code determining whether the Hadoop command exited thinks that all is fine,
and fails later when it tries to check if the JobTracker's Jetty server is up. In the process
it loses about a minute's time.

If the max-master-failures variable is > 1, a second attempt is made to launch the JobTracker.
On similar hardware and configuration, the same timing issue shows up. By the time 2 machines
have failed, the HOD client times out waiting for the JobTracker URL and the cluster is deallocated
by deleting the Torque job.

This is a fairly serious issue, because it nullifies the enhancement made in HADOOP-3184,
as the JobTracker is not launched on enough machines to give it a chance of coming up on a
good machine.

Introducing a minor delay of just a second in the HodRing code fixed the problem that is described
above. It seems fair to wait a bit for the Hadoop command to actually exit (if there are errors)
before checking for it's error code.

> Hod does not  report job tracker failure on hod client side when job tracker fails to
come up
> ---------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3531
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3531
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>    Affects Versions: 0.18.0
>            Reporter: Karam Singh
>            Priority: Blocker
>
> Hod does not  report job tracker failure on hod client side when job tracker fails to
come up. 
> When max-master-failure > 1
> hod client does not properly show why job tracker failed to come up, while in case namenode
proper error message is displayed.
> Also in namenode failure ringmaster log contains information such as -: "Detected errors
(3) beyond allowed number of failures (2). Flagging error to client"
> while no such information is there in ringmaster log for job tracker failures

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message