hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4586) Fault tolerant Hadoop Job Tracker
Date Wed, 05 Nov 2008 14:55:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12645235#action_12645235

Steve Loughran commented on HADOOP-4586:

This is an interesting project which will provide much thesis work, especially from the testing
and proof of correctness perspectives.

-There are some implicit assumptions about the ability of the infrastructure to provision
hardware, namely that Cold Standby is inappropriate. If a virtual machine can be provisioned
and brought up live within a minute, Cold Standby is surprisingly viable, and, on pay-as-you-go
infrastructure, cost-effective.

-The statement that forwarding all state changes to all slaves -Hot Standby- is best needs
to be qualified with estimated load values and the impact of the events on the network. Is
there a cluster size or map/reduce job lifetime in which the state traffic will become an
issue, or is it just load on the nodes.  

-How do you intend to implement failover without notifying the task trackers? DNS update?

-I would like to see some coverage of the election protocol, in particular, how to coordinate
such an election over an infrastructure which denies multicast IP (e.g. Amazon EC2).

-Determining the liveness of the JobTracker is going to be hard. Using Lamport's definitions,
it is only live if it is capable of performing work within bounded time, so the true way to
determine health is to submit work to the system. Early failures: IPC deadlock, host outage
etc, may be detectable early, but some failure modes may be hard to detect. Some of the ongoing
work in HADOOP-3628 can act as a starting point, but it is inadequate if you really want "HA".

-Ignoring HDFS availability, What is going to happen when the farm partitions and both partitions
have slaves and a set of task trackers? Who will be in charge?

-I would have expected some citations for an MSc project; presumably this is an early draft.

-Take a look at Anubis; this is how we implement partition awareness/HA, though it currently
uses Multicast to bootstrap, so will not work on EC2 without adding a new discovery mechanism

> Fault tolerant Hadoop Job Tracker
> ---------------------------------
>                 Key: HADOOP-4586
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4586
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.18.0
>         Environment: High availability enterprise system
>            Reporter: Francesco Salbaroli
>         Attachments: FaultTolerantHadoop.pdf
>   Original Estimate: 2016h
>  Remaining Estimate: 2016h
> The Hadoop framework has been designed, in an eort to enhance perfor-
> mances, with a single JobTracker (master node). It's responsibilities varies
> from managing job submission process, compute the input splits, schedule
> the tasks to the slave nodes (TaskTrackers) and monitor their health.
> In some environments, like the IBM and Google's Internet-scale com-
> puting initiative, there is the need for high-availability, and performances
> becomes a secondary issue. In this environments, having a system with
> a Single Point of Failure (such as Hadoop's single JobTracker) is a major
> concern.
> My proposal is to provide a redundant version of Hadoop by adding
> support for multiple replicated JobTrackers. This design can be approached
> in many dierent ways. 
> In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0
> I wrote an overview of the problem and some approaches to solve it.
> I post this to the community to gather feedback on the best way to proceed in my work.
> Thank you!

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message