Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of Paul.Harter@oracle.com
 designates 156.151.31.81 as permitted sender)
From: "Paul K. Harter, Jr." <Paul.Harter@oracle.com>
To: <user@hadoop.apache.org>
Cc: "'Paul K. Harter, Jr.'" <Paul.Harter@oracle.com>
Subject: Network paritions and Failover Times
Date: Tue, 22 Apr 2014 14:49:13 -0700
Message-ID: <006f01cf5e74$b371f140$1a55d3c0$@Harter@oracle.com>
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_NextPart_000_0070_01CF5E3A.07131940"
Thread-Index: Ac9edLMRqpxjxEg9Qm2dmvvC+k4Qwg==
Content-Language: en-us

This is a multi-part message in MIME format.

------=_NextPart_000_0070_01CF5E3A.07131940
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit

I am trying to understand the mechanisms and timing involved when Hadoop is
faced

with a network partition.  Suppose we have a large Hadoop cluster configured
with

automatic failover:

1)      Active Name node

2)      Standby NameNode

3)      Quorum journal nodes  (which we'll ignore for now)

4)      Zookeeper ensemble with 3 nodes

 
Suppose the zookeeper session from the active name node happens to be direct

to the ZK leader node, and that the system experiences a network failure
resulting

in 2 partitions (A and B) with the nodes distributed as follows:

A)     Zookeeper leader node; 
Active NameNode

B)      2 Zookeeper followers
Standby NameNode

 
QUESTIONS:

Seems the result should be that both Zookeeper and the NameNode fail over to

partition B,  but I wanted to confirm the sequence of actions as outlined
below.

Does this look right?

 
If the network failure occurs at time zero, then how long should this whole
sequence

take, if for example, syncLimit is 5 ticks and the NameNode sessionTImeout
is 10 ticks??

 
FAILOVER SEQUENCE (as I understand it):

 
    1) Leader, who ends up in the minority, loses connection to remaining

       servers. 

 
    2) After syncLimit, the ZK ensemble realizes there's a problem.  If a

       follower loses connection, then he is dropped by the leader, and

       no longer participates in voting. 

 
       However, in this case the Leader no longer has quorum, so he has to

       relinquish leadership.  He stops responding to client requests,

       enters the LOOKING state and and starts trying to form/join a quorum

       (it informs the ZK client library, and) all clients are notified

       with a DISCONNECTED event.  (or is it that the DISCONNECTED event

       delivered to the client library who delivers connection loss

       exceptions to clients?)  

 
       The remaining nodes on the majority side enter leader election and

       choose a new leader (which starts a new epoch) on the majority

       side. 

 
    3) All clients who were connected to the (now former) leader are told to

       reconnect and will either fail if they can't talk to a node on the

       new majority side or will succeed in connecting with a node in the
new

       quorum.  

 
    4) Meanwhile, when the Active NameNode is informed that its server has

       become disconnected (DISCONNECTED event), it must stop responding

       like the Active NameNode.  

       When the ZK quorum reforms and does not get heartbeats from the

       (formerly) Active Name node, will eventually (SessionTimeout)

       declare its session dead.  This deletes the ephemeral node being

       used to hold its lock on its status as "Active" and triggers the

       Watcher for the Standby NameNode. 

       The Standby then attempts to compete for Active Name Node election

       and should win and become the new Active.

 
------=_NextPart_000_0070_01CF5E3A.07131940
Content-Type: text/html;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" =
xmlns:o=3D"urn:schemas-microsoft-com:office:office" =
xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" =
xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta =
http-equiv=3DContent-Type content=3D"text/html; =
charset=3Dus-ascii"><meta name=3DGenerator content=3D"Microsoft Word 12 =
(filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.MsoListParagraph, li.MsoListParagraph, div.MsoListParagraph
	{mso-style-priority:34;
	margin-top:0in;
	margin-right:0in;
	margin-bottom:0in;
	margin-left:.5in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Calibri","sans-serif";
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
/* List Definitions */
@list l0
	{mso-list-id:995379761;
	mso-list-type:hybrid;
	mso-list-template-ids:-687592198 887007800 67698713 67698715 67698703 =
67698713 67698715 67698703 67698713 67698715;}
@list l0:level1
	{mso-level-text:"%1\)";
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	margin-left:24.75pt;
	text-indent:-.25in;}
@list l1
	{mso-list-id:1355233106;
	mso-list-type:hybrid;
	mso-list-template-ids:-50534108 1093288898 67698713 67698715 67698703 =
67698713 67698715 67698703 67698713 67698715;}
@list l1:level1
	{mso-level-number-format:alpha-upper;
	mso-level-text:"%1\)";
	mso-level-tab-stop:none;
	mso-level-number-position:left;
	text-indent:-.25in;}
ol
	{margin-bottom:0in;}
ul
	{margin-bottom:0in;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue =
vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal>I am =
trying to understand the mechanisms and timing involved when Hadoop is =
faced<o:p></o:p></p><p class=3DMsoNormal>with a network partition. =
&nbsp;Suppose we have a large Hadoop cluster configured =
with<o:p></o:p></p><p class=3DMsoNormal>automatic =
failover:<o:p></o:p></p><p class=3DMsoListParagraph =
style=3D'margin-left:24.75pt;text-indent:-.25in;mso-list:l0 level1 =
lfo1'><![if !supportLists]><span style=3D'mso-list:Ignore'>1)<span =
style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
</span></span><![endif]>Active Name node<o:p></o:p></p><p =
class=3DMsoListParagraph =
style=3D'margin-left:24.75pt;text-indent:-.25in;mso-list:l0 level1 =
lfo1'><![if !supportLists]><span style=3D'mso-list:Ignore'>2)<span =
style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
</span></span><![endif]>Standby NameNode<o:p></o:p></p><p =
class=3DMsoListParagraph =
style=3D'margin-left:24.75pt;text-indent:-.25in;mso-list:l0 level1 =
lfo1'><![if !supportLists]><span style=3D'mso-list:Ignore'>3)<span =
style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
</span></span><![endif]>Quorum journal nodes&nbsp; (which we&#8217;ll =
ignore for now)<o:p></o:p></p><p class=3DMsoListParagraph =
style=3D'margin-left:24.75pt;text-indent:-.25in;mso-list:l0 level1 =
lfo1'><![if !supportLists]><span style=3D'mso-list:Ignore'>4)<span =
style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
</span></span><![endif]>Zookeeper ensemble with 3 nodes<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>Suppose the =
zookeeper session from the active name node happens to be =
direct<o:p></o:p></p><p class=3DMsoNormal>to the ZK leader node, and =
that the system experiences a network failure resulting<o:p></o:p></p><p =
class=3DMsoNormal>in 2 partitions (A and B) with the nodes distributed =
as follows:<o:p></o:p></p><p class=3DMsoListParagraph =
style=3D'text-indent:-.25in;mso-list:l1 level1 lfo2'><![if =
!supportLists]><span style=3D'mso-list:Ignore'>A)<span =
style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp; =
</span></span><![endif]>Zookeeper leader node; <br>Active =
NameNode<o:p></o:p></p><p class=3DMsoListParagraph =
style=3D'text-indent:-.25in;mso-list:l1 level1 lfo2'><![if =
!supportLists]><span style=3D'mso-list:Ignore'>B)<span =
style=3D'font:7.0pt "Times New Roman"'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; =
</span></span><![endif]>2 Zookeeper followers<br>Standby =
NameNode<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>QUESTIONS:<o:p></o:p></p><p class=3DMsoNormal>Seems =
the result should be that both Zookeeper and the NameNode fail over =
to<o:p></o:p></p><p class=3DMsoNormal>partition B, &nbsp;but I wanted to =
confirm the sequence of actions as outlined below.<o:p></o:p></p><p =
class=3DMsoNormal>Does this look right?<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>If the =
network failure occurs at time zero, then how long should this whole =
sequence<o:p></o:p></p><p class=3DMsoNormal>take, if for example, =
syncLimit is 5 ticks and the NameNode sessionTImeout is 10 =
ticks??<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p class=3DMsoNormal>FAILOVER =
SEQUENCE (as I understand it):<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp; 1) Leader, who ends up in the =
minority, loses connection to remaining<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; servers. =
<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp; 2) After syncLimit, the ZK ensemble =
realizes there's a problem.&nbsp; If a<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; follower loses =
connection, then he is dropped by the leader, and<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; no longer =
participates in voting. <o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; However, in this =
case the Leader no longer has quorum, so he has to<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; relinquish =
leadership.&nbsp; He stops responding to client =
requests,<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; enters the =
LOOKING state and and starts trying to form/join a =
quorum<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (it informs the =
ZK client library, and) all clients are notified<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; with a =
DISCONNECTED event.&nbsp; (or is it that the DISCONNECTED =
event<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; delivered to the =
client library who delivers connection loss<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; exceptions to =
clients?)&nbsp; <o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; The remaining =
nodes on the majority side enter leader election and<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; choose a new =
leader (which starts a new epoch) on the majority<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; side. =
<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp; 3) All clients who were connected =
to the (now former) leader are told to<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; reconnect and =
will either fail if they can't talk to a node on the<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; new majority side =
or will succeed in connecting with a node in the new<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; quorum.&nbsp; =
<o:p></o:p></p><p class=3DMsoNormal><o:p>&nbsp;</o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp; 4) Meanwhile, when the Active =
NameNode is informed that its server has<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; become =
disconnected (DISCONNECTED event), it must stop =
responding<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; like the Active =
NameNode.&nbsp; <o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;When the ZK =
quorum reforms and does not get heartbeats from the<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (formerly) Active =
Name node, will eventually (SessionTimeout)<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; declare its =
session dead.&nbsp; This deletes the ephemeral node =
being<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; used to hold its =
lock on its status as &quot;Active&quot; and triggers =
the<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Watcher for the =
Standby NameNode. <o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The Standby =
then attempts to compete for Active Name Node election<o:p></o:p></p><p =
class=3DMsoNormal>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; and should win =
and become the new Active.<o:p></o:p></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div></body></html>
------=_NextPart_000_0070_01CF5E3A.07131940--