Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 398D7119D0 for ; Tue, 22 Apr 2014 21:49:57 +0000 (UTC) Received: (qmail 86908 invoked by uid 500); 22 Apr 2014 21:49:47 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 86744 invoked by uid 500); 22 Apr 2014 21:49:46 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 86737 invoked by uid 99); 22 Apr 2014 21:49:46 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Apr 2014 21:49:46 +0000 X-ASF-Spam-Status: No, hits=0.9 required=5.0 tests=HTML_MESSAGE,MSGID_MULTIPLE_AT,RCVD_IN_DNSWL_MED,SPF_PASS,UNPARSEABLE_RELAY X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Paul.Harter@oracle.com designates 156.151.31.81 as permitted sender) Received: from [156.151.31.81] (HELO userp1040.oracle.com) (156.151.31.81) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Apr 2014 21:49:38 +0000 Received: from acsinet22.oracle.com (acsinet22.oracle.com [141.146.126.238]) by userp1040.oracle.com (Sentrion-MTA-4.3.2/Sentrion-MTA-4.3.2) with ESMTP id s3MLnEZq011871 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Tue, 22 Apr 2014 21:49:15 GMT Received: from aserz7022.oracle.com (aserz7022.oracle.com [141.146.126.231]) by acsinet22.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s3MLnEOw017499 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Tue, 22 Apr 2014 21:49:14 GMT Received: from abhmp0001.oracle.com (abhmp0001.oracle.com [141.146.116.7]) by aserz7022.oracle.com (8.14.4+Sun/8.14.4) with ESMTP id s3MLnEVX017495 for ; Tue, 22 Apr 2014 21:49:14 GMT Received: from PKHARTERLAP1 (/10.159.178.244) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 22 Apr 2014 14:49:13 -0700 From: "Paul K. Harter, Jr." To: Cc: "'Paul K. Harter, Jr.'" Subject: Network paritions and Failover Times Date: Tue, 22 Apr 2014 14:49:13 -0700 Message-ID: <006f01cf5e74$b371f140$1a55d3c0$@Harter@oracle.com> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_0070_01CF5E3A.07131940" X-Mailer: Microsoft Office Outlook 12.0 Thread-Index: Ac9edLMRqpxjxEg9Qm2dmvvC+k4Qwg== Content-Language: en-us X-Source-IP: acsinet22.oracle.com [141.146.126.238] X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. ------=_NextPart_000_0070_01CF5E3A.07131940 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit I am trying to understand the mechanisms and timing involved when Hadoop is faced with a network partition. Suppose we have a large Hadoop cluster configured with automatic failover: 1) Active Name node 2) Standby NameNode 3) Quorum journal nodes (which we'll ignore for now) 4) Zookeeper ensemble with 3 nodes Suppose the zookeeper session from the active name node happens to be direct to the ZK leader node, and that the system experiences a network failure resulting in 2 partitions (A and B) with the nodes distributed as follows: A) Zookeeper leader node; Active NameNode B) 2 Zookeeper followers Standby NameNode QUESTIONS: Seems the result should be that both Zookeeper and the NameNode fail over to partition B, but I wanted to confirm the sequence of actions as outlined below. Does this look right? If the network failure occurs at time zero, then how long should this whole sequence take, if for example, syncLimit is 5 ticks and the NameNode sessionTImeout is 10 ticks?? FAILOVER SEQUENCE (as I understand it): 1) Leader, who ends up in the minority, loses connection to remaining servers. 2) After syncLimit, the ZK ensemble realizes there's a problem. If a follower loses connection, then he is dropped by the leader, and no longer participates in voting. However, in this case the Leader no longer has quorum, so he has to relinquish leadership. He stops responding to client requests, enters the LOOKING state and and starts trying to form/join a quorum (it informs the ZK client library, and) all clients are notified with a DISCONNECTED event. (or is it that the DISCONNECTED event delivered to the client library who delivers connection loss exceptions to clients?) The remaining nodes on the majority side enter leader election and choose a new leader (which starts a new epoch) on the majority side. 3) All clients who were connected to the (now former) leader are told to reconnect and will either fail if they can't talk to a node on the new majority side or will succeed in connecting with a node in the new quorum. 4) Meanwhile, when the Active NameNode is informed that its server has become disconnected (DISCONNECTED event), it must stop responding like the Active NameNode. When the ZK quorum reforms and does not get heartbeats from the (formerly) Active Name node, will eventually (SessionTimeout) declare its session dead. This deletes the ephemeral node being used to hold its lock on its status as "Active" and triggers the Watcher for the Standby NameNode. The Standby then attempts to compete for Active Name Node election and should win and become the new Active. ------=_NextPart_000_0070_01CF5E3A.07131940 Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

I am = trying to understand the mechanisms and timing involved when Hadoop is = faced

with a network partition. =  Suppose we have a large Hadoop cluster configured = with

automatic = failover:

1)      = Active Name node

2)      = Standby NameNode

3)      = Quorum journal nodes  (which we’ll = ignore for now)

4)      = Zookeeper ensemble with 3 nodes

 

Suppose the = zookeeper session from the active name node happens to be = direct

to the ZK leader node, and = that the system experiences a network failure resulting

in 2 partitions (A and B) with the nodes distributed = as follows:

A)     = Zookeeper leader node;
Active = NameNode

B)      = 2 Zookeeper followers
Standby = NameNode

 

QUESTIONS:

Seems = the result should be that both Zookeeper and the NameNode fail over = to

partition B,  but I wanted to = confirm the sequence of actions as outlined below.

Does this look right?

 

If the = network failure occurs at time zero, then how long should this whole = sequence

take, if for example, = syncLimit is 5 ticks and the NameNode sessionTImeout is 10 = ticks??

 

 

FAILOVER = SEQUENCE (as I understand it):

 

    1) Leader, who ends up in the = minority, loses connection to remaining

       servers. =

 

    2) After syncLimit, the ZK ensemble = realizes there's a problem.  If a

       follower loses = connection, then he is dropped by the leader, and

       no longer = participates in voting.

 

       However, in this = case the Leader no longer has quorum, so he has to

       relinquish = leadership.  He stops responding to client = requests,

       enters the = LOOKING state and and starts trying to form/join a = quorum

       (it informs the = ZK client library, and) all clients are notified

       with a = DISCONNECTED event.  (or is it that the DISCONNECTED = event

       delivered to the = client library who delivers connection loss

       exceptions to = clients?) 

 

       The remaining = nodes on the majority side enter leader election and

       choose a new = leader (which starts a new epoch) on the majority

       side. =

 

    3) All clients who were connected = to the (now former) leader are told to

       reconnect and = will either fail if they can't talk to a node on the

       new majority side = or will succeed in connecting with a node in the new

       quorum.  =

 

    4) Meanwhile, when the Active = NameNode is informed that its server has

       become = disconnected (DISCONNECTED event), it must stop = responding

       like the Active = NameNode. 

       When the ZK = quorum reforms and does not get heartbeats from the

       (formerly) Active = Name node, will eventually (SessionTimeout)

       declare its = session dead.  This deletes the ephemeral node = being

       used to hold its = lock on its status as "Active" and triggers = the

       Watcher for the = Standby NameNode.

       The Standby = then attempts to compete for Active Name Node election

       and should win = and become the new Active.

 

------=_NextPart_000_0070_01CF5E3A.07131940--