ignite-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ilya Kasnacheev <ilya.kasnach...@gmail.com>
Subject Re: Issue with recovery of Kubernetes deployed server node after abnormal shutdown
Date Wed, 04 Sep 2019 13:59:50 GMT
Hello!

I think the mere presence of lock file is not enough. It can be reacquired
if previous node is down. Maybe it can't be reacquired due to e.g. file
permissions issues?

Setting consistentId will prevent node from starting at all if it can't
lock storage.

Regards,
-- 
Ilya Kasnacheev


сб, 31 авг. 2019 г. в 08:52, Raymond Wilson <raymond_wilson@trimble.com>:

> Hi Ilya,
>
> It is curious you do not see the lock failure error.
>
> Currently our approach is that the Kubernetes nodes (pods) are stateless
> and are provisioned against the EFS volume at the point they are created.
> In this way the consistent id as such is a part of the persistent store and
> is inherited by the Kubernetes pod when it attaches to the persistent
> volume.
> In general this works really well, except for this instance related to the
> lock file being left after abnormal node termination.
>
> The particular issue seems to occur due to the presence of the lock file
> at the point the ignite node in the Kubernetes pod tries to access the
> persistent store. IE: The new pod sees the lock file and determines this
> persistent volume is not available for the new pod to access, so it creates
> a new node.
>
> We are happy to modify our approach to align with IA best practices. Does
> assigning consistent IDs manually, rather than using the default consistent
> ID, mean that the lock file being present does not cause an issue? How
> would we align consistent ID specification with Kubernetes automatic pod
> replacement on IA node failure/
>
> Thanks,
> Raymond.
>
>
> On Sat, Aug 31, 2019 at 2:24 AM Ilya Kasnacheev <ilya.kasnacheev@gmail.com>
> wrote:
>
>> Hello!
>>
>> Maybe I misunderstand something, my recommendation will be to provide
>> consistentId for all nodes. This way, it would be impossible to boot with
>> wrong/different data dir.
>>
>> It's not obvious why the error "Unable to acquire lock" happens, I didn't
>> see that. What's your target OS? Are you sure all other instances are
>> completely stopped at the time of this node startup?
>>
>> Regards,
>> --
>> Ilya Kasnacheev
>>
>>
>> ср, 28 авг. 2019 г. в 06:50, Raymond Wilson <raymond_wilson@trimble.com>:
>>
>>> We have an Ignite grid deployed on a Kubernetes cluster using an AWS EFS
>>> volume to store the persistent data for all nodes in the grid.
>>>
>>> The Ignite based services running on those pods respond to SIG_TERM
>>> style graceful shutdown and restart events by reattaching to the persistent
>>> stores in the EFS volume.
>>>
>>> Ignite maintains a lock file in the persistence folder for each node
>>> that indicates if that persistence store is owned by a running Ignite
>>> server node. When the node shots down gracefully the lock file is removed
>>> allowing the a new Ignite node in a Kubernetes pod to use it.
>>>
>>> If a Ignite server node hosted in a Kubernetes pod is subject to
>>> abnormal termination (eg: via SIG_KIILL or a failure in the underlying EC2
>>> server hosting the K8s pod), then the lock file is not removed. When a new
>>> K8s pod starts up to replace the one that failed, it does not reattach to
>>> the existing node persistence folder due to the lock file. Instead it
>>> creates another node persistence folder which leads to apparent data loss.
>>>
>>> This can be seen in the log fragment below where a new pod examines the
>>> node00 folder, finds a lock file and proceeds to create a node01 folder due
>>> to that lock.
>>>
>>> [image: image.png]
>>>
>>> My question is: What is the best way to manage this so that abnormal
>>> termination recovery copes with the orphaned lock file without the need for
>>> DevOps intervention?
>>>
>>> Thanks,
>>> Raymond.
>>>
>>>

Mime
View raw message