Mailing-List: contact user-help@ignite.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@ignite.apache.org
Date: Sat, 19 Aug 2017 17:34:39 -0700 (MST)
From: Biren <biren.shah@servicenow.com>
To: user@ignite.apache.org
Message-ID: <4F13E55E-D6B0-400F-ADC6-9DB99F6AFCF6@servicenow.com>
Subject: Cluster segmentation
MIME-Version: 1.0
Content-Type: multipart/alternative;
	boundary="----=_Part_133811_1753363416.1503189279733"
archived-at: Sun, 20 Aug 2017 00:34:48 -0000

------=_Part_133811_1753363416.1503189279733
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Hi,
I have embedded Ignite into my application and using it for distributed caches. I am running Ignite cluster in my lab environment. I have two nodes in the cluster. Time to time I get node segmented event and the node which receives it dies abruptly.

The application is registered to three discovery events node_left, node_failed and node_segmented. On receiving these events each node checks if it is now oldest node is the cluster. This is to check if oldest node has left/failed.

I am also listening to life cycle events. I am interested in before_node_stop and after_node_stop events. On receiving these events, I need to stop another component of the application.


  1.  What are the reasons of getting node_segmented event?
     *   One reason is obviously network glitch. Node losing connectivity with other members
     *   Can high memory usage/long GC pause be reason for segmentation?
     *   Is there a way to get cause of the segmentation?
  2.  After getting node_segmented event, I immediately got before_node_stop event. But after_node_stop did not follow. So node was kind of left in some inconsistent state. Never recovered from that.
     *   Is it possible that on receiving node_segmented event, when I tried to get the oldest node in the cluster caused the node to stop?
Event timeline from both nodes:

Application 1:
08/19/17  10:40:10 :  received node failed event. The event was caused by application 2.
08/19/17  10:40:10 :  [10:40:10] Topology snapshot [ver=3, servers=1, clients=0, CPUs=32, heap=14.0GB]

Application 2:
08/19/17 10:40:28 : received node segmented event. The event was caused by application 2
08/19/17 10:40:28 : Checking if oldest has changed
08/19/17 10:40:28 : Ignite Lifecycle event received: BEFORE_NODE_STOP. Fires event to stop another component
08/19/17 10:40:28 : dependent component stops
08/19/17 10:40:28 : received node failed event. The event was caused by application 1.
08/19/17 10:40:10 : Topology snapshot [ver=3, servers=1, clients=0, CPUs=32, heap=14.0GB]

Thanks,
Biren


--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Cluster-segmentation-tp16314.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.
------=_Part_133811_1753363416.1503189279733
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: 7bit


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Title" content="">
<meta name="Keywords" content="">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">


<div class="WordSection1">
<p class="MsoNormal"><span style="font-size:11.0pt">Hi,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I have embedded Ignite into my application and using it for distributed caches. I am running Ignite cluster in my lab environment. I have two nodes in the cluster. Time to time I get node segmented event and
 the node which receives it dies abruptly. <o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">The application is registered to three discovery events node_left, node_failed and node_segmented. On receiving these events each node checks if it is now oldest node is the cluster. This is to check if oldest
 node has left/failed.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">I am also listening to life cycle events. I am interested in before_node_stop and after_node_stop events. On receiving these events, I need to stop another component of the application.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p>&nbsp;</o:p></span></p>
<ol style="margin-top:0in" start="1" type="1">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1"><span style="font-size:11.0pt">What are the reasons of getting node_segmented event?<o:p></o:p></span>
<ol style="margin-top:0in" start="1" type="a">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo1"><span style="font-size:11.0pt">One reason is obviously network glitch. Node losing connectivity with other members<o:p></o:p></span></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo1"><span style="font-size:11.0pt">Can high memory usage/long GC pause be reason for segmentation?<o:p></o:p></span></li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo1"><span style="font-size:11.0pt">Is there a way to get cause of the segmentation?<o:p></o:p></span></li></ol>
</li><li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level1 lfo1"><span style="font-size:11.0pt">After getting node_segmented event, I immediately got before_node_stop event. But after_node_stop did not follow. So node was kind of left in some inconsistent
 state. Never recovered from that.<o:p></o:p></span>
<ol style="margin-top:0in" start="1" type="a">
<li class="MsoListParagraph" style="margin-left:0in;mso-list:l0 level2 lfo1"><span style="font-size:11.0pt">Is it possible that on receiving node_segmented event, when I tried to get the oldest node in the cluster caused the node to stop?<o:p></o:p></span></li></ol>
</li></ol>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Event timeline from both nodes:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Application 1:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">08/19/17 &nbsp;10:40:10 :&nbsp; received node failed event. The event was caused by application 2.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">08/19/17 &nbsp;10:40:10 : &nbsp;[10:40:10] Topology snapshot [ver=3, servers=1, clients=0, CPUs=32, heap=14.0GB]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Application 2:<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">08/19/17 10:40:28 : received node segmented event. The event was caused by application 2<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">08/19/17 10:40:28 : Checking if oldest has changed<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">08/19/17 10:40:28 : Ignite Lifecycle event received: BEFORE_NODE_STOP. Fires event to stop another component<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">08/19/17 10:40:28 : dependent component stops<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">08/19/17 10:40:28 : received node failed event. The event was caused by application 1.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">08/19/17 10:40:10 : Topology snapshot [ver=3, servers=1, clients=0, CPUs=32, heap=14.0GB]<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Thanks,<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt">Biren<o:p></o:p></span></p>
</div>


<br/><hr align="left" width="300" />
View this message in context: <a href="http://apache-ignite-users.70518.x6.nabble.com/Cluster-segmentation-tp16314.html">Cluster segmentation</a><br/>
Sent from the <a href="http://apache-ignite-users.70518.x6.nabble.com/">Apache Ignite Users mailing list archive</a> at Nabble.com.<br/>
------=_Part_133811_1753363416.1503189279733--