helix-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From kishore g <g.kish...@gmail.com>
Subject Re: Messages building up in helix
Date Mon, 28 Nov 2016 22:06:56 GMT
If you know that instance will never come back up with the same name. You
can do the following

- disable the instance
- wait for all partitions hosted by this instance to get OFFLINE/DROPPED
state.
- disconnect from the cluster
- Use ZkHelixAdmin to drop the instance from the cluster. This should clean
up everything related to the old node.

You can also do this via the controller node. watch for liveinstances and
if nodes are not present under liveinstances you can delete those nodes.
One suggestion here is - when a node shutsdown, write the state to
instanceConfig of that node say STATE="SHUTDOWN". Your reaper thread can
look for nodes that are in this state and invoke admin.dropInstance.

dropInstance will take care of cleaning up everything related to a dead
node.




On Mon, Nov 28, 2016 at 1:56 PM, Sesh Jalagam <sjalagam@box.com> wrote:

> Kishore thanks,
>
> Option 1 and Option 3 are plausible. Option 2 is not feasible, even though
> the cluster name is same, instance name is different (usually this a random
> value)
>
> With Option 1 what should I be looking in the External View, should I be
> looking at all the resources that should have been transitioned off.
>
> With Option 3, when a cluster is redeployed the controller is moving
> around (because of leader election) from old nodes to old nodes, so I
> wonder if the controller will miss any messages for dead nodes. Are I can
> simply have a reaper that comes up and deletes all messages that are
> destined for instances that are not present in /LIVEINSTANCES/.
>
> How should I be dealing with <cluster_id>INSTANCES/INSTANCES/CURRENTSTATES
> this has stale current states ( session id that is not valid).
>
>
>
> On Mon, Nov 28, 2016 at 12:52 PM, kishore g <g.kishore@gmail.com> wrote:
>
>> Looks like nodes add and remove themselves quite often. After you disable
>> the instance, Helix will send messages to go from ONLINE to OFFLINE. Looks
>> like the nodes shut down before they get those messages and when they come
>> back up, they use a different instance id.
>>
>> There are two solutions
>> - During shut down - after disabling wait for the state to be reflected
>> in the External View.
>> - During start up - If possible, re-join the cluster with the same name.
>> If you do that, Helix will remove old messages.
>>
>> A third option is to support autoCleanUp in Helix. Helix controller can
>> monitor the cluster for dead nodes and remove them automatically after some
>> time.
>>
>>
>>
>> On Mon, Nov 28, 2016 at 12:39 PM, Sesh Jalagam <sjalagam@box.com> wrote:
>>
>>> <clustername>/INSTANCES/INSTANCES/MESSAGES has already read messages.
>>>
>>> Here is an example.
>>>     ,"FROM_STATE":"ONLINE"
>>>     ,"MSG_STATE":"read"
>>>     ,"MSG_TYPE":"STATE_TRANSITION"
>>>     ,"STATE_MODEL_DEF":"OnlineOffline"
>>>     ,"STATE_MODEL_FACTORY_NAME":"DEFAULT"
>>>     ,"TO_STATE":"OFFLINE
>>>
>>> I see these messages after the participant is disabled and dropped i.e
>>> <clustername>/INSTANCES/<PARTICIPANT_ID> is removed.
>>>
>>> Thanks
>>>
>>>
>>> On Mon, Nov 28, 2016 at 12:18 PM, kishore g <g.kishore@gmail.com> wrote:
>>>
>>>> <clustername>/INSTANCES/INSTANCES/MESSAGES by this do you mean
>>>> <clustername>/INSTANCES/<PARTICIPANT_ID>/MESSAGES
>>>>
>>>> What kind of messages do you see under these nodes.
>>>>
>>>>
>>>>
>>>> On Mon, Nov 28, 2016 at 12:04 PM, Sesh Jalagam <sjalagam@box.com>
>>>> wrote:
>>>>
>>>>> Our set up is following.
>>>>>
>>>>> - Controller (leader elected from one of the cluster nodes)
>>>>>
>>>>> - Cluster of nodes as participants in OnlineOffline StateModel
>>>>>
>>>>> - Set of resources with partitions.
>>>>>
>>>>>
>>>>> Each node on its startup, creates a controller adds a participant if
>>>>> its not existing and waits for the callbacks to handle partition
>>>>> rebalancing.
>>>>>
>>>>> Please not this cluster is created on the fly multiple times a day
>>>>> (actual cluster is not deleted, but new participants are removed and
>>>>> re-added)
>>>>>
>>>>>
>>>>> Everything works fine in production, but I see that the znodes
>>>>> in <clustername>/INSTANCES/INSTANCES/MESSAGES is growing.
>>>>>
>>>>> What is <cluster_id>/INSTANCES/INSTANCES used for, is there a way
for
>>>>> the messages to be deleted automatically.
>>>>>
>>>>> I see similar buildup in <cluster_id>INSTANCES/INSTANCE
>>>>> S/CURRENTSTATES.
>>>>>
>>>>>
>>>>> Thanks
>>>>> --
>>>>> - Sesh .J
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> - Sesh .J
>>>
>>
>>
>
>
> --
> - Sesh .J
>

Mime
View raw message