helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Old sessions and current sessions not matching on restart
Date Tue, 06 Jan 2015 19:36:23 GMT
The message is not fatal, in fact its a check to safe guard against various
things that can go wrong in distributed systems.

Here is the explanation of the ERROR message.

Every time a node starts up, Helix assigns a session Id to the participant.
This session id along with the participant Id serves as a destination id
for any message sent by controller to a participant. So, in this case, the
controller sent a message to previous session Id but by the time it was
processed by the participant its id was changed. The participants discards
such messages where the ids dont match. This is very important to ensure
that cluster state does not diverge.

As I mentioned earlier, these are non fatal and cluster should eventually
reach its stable ideal state.

Having said that, the repeated occurrence of this implies some thing is off
is the system. For example, after the node shuts down, the live instance
ephemeral node on Zookeeper might continue to linger around for some time.
This should not happen in general but can happen when ZK is over loaded etc.

One way to test this hypothesis would be to wait for the liveinstance to
disappear after disconnecting from the cluster. Ideally, Helix should have
done this (we will file a jira). Can you try adding a sleep or wait for 30
seconds after disconnecting from the cluster.

Kishore G

On Tue, Jan 6, 2015 at 10:53 AM, Varun Sharma <varun@pinterest.com> wrote:

> Yeah I am using the helixManager.disconnect() - is that sufficient to
> close out the zk session ?
> THanks
> Varun
> On Mon, Jan 5, 2015 at 7:02 PM, kishore g <g.kishore@gmail.com> wrote:
>> What code do you have in your shutdown hook. Are you disconnecting
>> gracefully from the cluster and waiting until the liveinstance znode
>> disappears.
>> thanks,
>> Kishore G
>> On Mon, Jan 5, 2015 at 4:11 PM, Varun Sharma <varun@pinterest.com> wrote:
>>> But then the nodes would restart and not have the assigned partitions
>>> since the controller would not write out the messages to open partitions
>>> which should have been on the restarting node ?
>>> On Mon, Jan 5, 2015 at 4:08 PM, kishore g <g.kishore@gmail.com> wrote:
>>>> Try pausing the cluster controller before restarting and unpause after
>>>> re start.
>>>>  On Jan 5, 2015 3:41 PM, "Varun Sharma" <varun@pinterest.com> wrote:
>>>>> Hi,
>>>>> When I do a cluster wide restart, I see the following errors being
>>>>> logged:
>>>>>  2015-01-05 22:08:27,526 [main] (ParticipantManagerHelper.java:234)
>>>>> INFO  *Carrying over old session: 149a14ada0d0323*, resource:
>>>>> $terrapin$data$meta_board_join$1415863274925 to current session:
>>>>> *149a14ada0d0324*
>>>>> This is then followed by a large number of errors:
>>>>> 2015-01-05 22:08:30,321 [main] (HelixTaskExecutor.java:559) WARN
>>>>> SessionId does NOT match. *expected sessionId: 149a14ada0d0324*,
>>>>> tgtSessionId in message: *149a14ada0d0323*, messageId:
>>>>> da2ce3df-b797-4a27-9916-862c27af290a
>>>>> Does this signify a problem - it happens everytime I do restart ?
>>>>> Thanks
>>>>> Varun

View raw message