Return-Path: Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: (qmail 39496 invoked from network); 19 Jan 2011 10:15:22 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 19 Jan 2011 10:15:22 -0000 Received: (qmail 15440 invoked by uid 500); 19 Jan 2011 10:15:22 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 14350 invoked by uid 500); 19 Jan 2011 10:15:18 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 14341 invoked by uid 99); 19 Jan 2011 10:15:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jan 2011 10:15:17 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 19 Jan 2011 10:15:15 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id p0JAErDS025766 for ; Wed, 19 Jan 2011 10:14:54 GMT Message-ID: <27432167.56641295432093922.JavaMail.jira@thor> Date: Wed, 19 Jan 2011 05:14:53 -0500 (EST) From: "Hari A V (JIRA)" To: mapreduce-issues@hadoop.apache.org Subject: [jira] Commented: (MAPREDUCE-225) Fault tolerant Hadoop Job Tracker MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/MAPREDUCE-225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12983639#action_12983639 ] Hari A V commented on MAPREDUCE-225: ------------------------------------ Hi, In my team, we also have been analysing on how to provide HA for Job Tracker. Our approach is also quite similar to Francesco's approach. The complete HA solution can be divided to three aspects 1. Sharing of job related state between Master and Slave job trackers This can be achieved with issues HADOOP-1876 and HADOOP-3245. 2. Failure Detection and Master Election We are preferring Zookeeper for this. We had quite bad experience with JGroups in some of our previous projects which include Deadlocks, network traffic overhead etc (May be latest version of JGroups is stable). We were forced to replace jgroups. Zookeeper is the best solution available for leader election. We have seen that Zookeeper is very well used in similar situations in "Katta" project and also some of our internal projects. 3. How to Notify JobClients and Task Trackers about the new Master, on failure. One option would be DNS as mentioned. Another option is providing a list of job tracker ips to JobClients and Task trackers. They can silently retry on all available ips in case of failure. At the server side, slave job trackers will not accept any service request. This way we can avoid split brain and network partition scenarios. Zookeeper cluster inherently avoids the split brain issues in leader election. We have not yet started our work. Please provide your valuable opinions. thanks Hari > Fault tolerant Hadoop Job Tracker > --------------------------------- > > Key: MAPREDUCE-225 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-225 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Environment: High availability enterprise system > Reporter: Francesco Salbaroli > Assignee: Francesco Salbaroli > Attachments: Enhancing the Hadoop MapReduce framework by adding fault.ppt, FaultTolerantHadoop.pdf, HADOOP-4586-0.1.patch, HADOOP-4586v0.3.patch, jgroups-all.jar > > > The Hadoop framework has been designed, in an eort to enhance perfor- > mances, with a single JobTracker (master node). It's responsibilities varies > from managing job submission process, compute the input splits, schedule > the tasks to the slave nodes (TaskTrackers) and monitor their health. > In some environments, like the IBM and Google's Internet-scale com- > puting initiative, there is the need for high-availability, and performances > becomes a secondary issue. In this environments, having a system with > a Single Point of Failure (such as Hadoop's single JobTracker) is a major > concern. > My proposal is to provide a redundant version of Hadoop by adding > support for multiple replicated JobTrackers. This design can be approached > in many dierent ways. > In the document at: http://sites.google.com/site/hadoopthesis/Home/FaultTolerantHadoop.pdf?attredirects=0 > I wrote an overview of the problem and some approaches to solve it. > I post this to the community to gather feedback on the best way to proceed in my work. > Thank you! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.