curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cameron McKenzie (JIRA)" <>
Subject [jira] [Commented] (CURATOR-308) SimpleDistributedQueue::take() hangs if container nodes are removed
Date Tue, 15 Mar 2016 22:33:33 GMT


Cameron McKenzie commented on CURATOR-308:

Actually, this is a bit more complicated.

[~randgalt] : In SimpleDistributedQueue, it uses EnsureContainers on the root path. This is
problematic because it means that when the last piece of work in the queue gets removed then
the root path gets removed, and won't get recreated until another piece of work is added.
This exposes a race condition in the take() method, whereby if the root node doesn't exist,
the take method will never return. This is because it blocks waiting for children to exist
in a path that doesn't exist, so the watch never fires.

So, there are two options to fix this I guess:
-Make the EnsureContainers stuff have an option of doing the ensure via a persistent node
rather than a container node.
-Modify the take method so that it can handle the root node disappearing.

I think that it's probably best to fix both. If the root node disappears for some reason currently
then the take method() will block forever which isn't ideal. It should instead run a checkExists()
watcher to wait for the node to come back again.


> SimpleDistributedQueue::take() hangs if container nodes are removed
> -------------------------------------------------------------------
>                 Key: CURATOR-308
>                 URL:
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 3.1.0
>         Environment: org.apache.curator:curator-recipes 3.1.0
> org.apache.curator:curator-test 3.1.0
>            Reporter: Philip Searle
>         Attachments:
> SimpleDistributedQueue creates the queue using container nodes if the ZooKeeper instance
supports this feature. If ZooKeeper runs the container node cleanup task while SimpleDistributedQueue::take()
is blocking, the call will not ever return.
> A similar issue occurs when calling poll(), resulting in it delaying until the timeout
has elapsed, even if a queue item was inserted after the container cleanup occurs.

This message was sent by Atlassian JIRA

View raw message