curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cameron McKenzie (JIRA)" <>
Subject [jira] [Commented] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state
Date Wed, 29 May 2019 02:12:00 GMT


Cameron McKenzie commented on CURATOR-525:

I have reproduced the problem with some debugging shenanigans. While I believe the suggested
fix will work, I wonder if we should also look at the way guaranteed deletes work. Maybe queue
them internally when there's connection loss instead of just retrying indefinitely while the
connection is not there?

> There is a race condition in Curator which might lead to fake SUSPENDED event and ruin
CuratorFrameworkImpl inner state 
> ------------------------------------------------------------------------------------------------------------------------
>                 Key: CURATOR-525
>                 URL:
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 4.2.0
>            Reporter: Mikhail Valiev
>            Assignee: Cameron McKenzie
>            Priority: Critical
>         Attachments:, background-thread-infinite-loop.png,
curator-race-condition.png, event-watcher-thread.png
> This was originally found in the 2.11.1 version of Curator, but I tested the latest
release as well, and the issue is still there.
> The issue is tied to guaranteed deletes and how it loops infinitely, if called when there
is no connection:
> client.delete().guaranteed().forPath(ourPath); 
> []
> This schedules a background operation which attempts to remove the node in infinite
loop. Each time a background operation fails due to connection loss it performs a check (validateConnection()
function) to see if the main thread is already aware of connection loss, and if it's not -
raises the connection loss event. The problem is that this peace of code is also executed
by the event watcher thread when connection events are happening - this leads to race condition.
So when connection is restored it's easily possible for the main thread to raise RECONNECTED
event and after that for background thread to raise SUSPENDED event.
> We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator inner Connection
state and leads to curator behaving unpredictably
> Attached some illustrations and Unit test to reproduce the issue. (Put debug point in
validateConnection() )
> *Possible solution*: in CuratorFrameworkImpl class adjust the processEvent() function
and add the following:
> if(event.getType() == CuratorEventType.SYNC) {
> connectionStateManager.addStateChange(ConnectionState.RECONNECTED);
> }
> If this is a same state as before - it will be ignored, if background operation succeeded,
but we are in SUSPENDED state - this would repair the Curator state and raise RECONNECTED

This message was sent by Atlassian JIRA

View raw message