zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@yahoo-inc.com>
Subject Re: Question about the Barrier Java example on the ZooKeeper documentation
Date Tue, 08 Mar 2011 13:59:14 GMT
I believe the goal of the examples was never to be a complete  
solutions to barriers or queues, but just to give a quick bootstrap to  
beginners. It is true, though, that the documentation page does not  
make that claim, and can be misleading.

I see two possible action points out of this discussion:
1- State clearly in the beginning that the example discussed is not  
correct under the assumption that a process may finish the computation  
before another has started, and the example is there for illustration  
2- Have another example following the current one that discusses the  
problem and shows how to fix it. This is an interesting option that  
illustrates how one could reason about a solution when developing with  

If you are interested in helping us fix it, Semih, then you could  
perhaps create a jira and assign yourself to fix it. I can help you out.


On Mar 7, 2011, at 11:23 AM, Semih Salihoglu wrote:

> Hi Mahadev,
> Sorry for the late response. I agree, actually in this other  
> documentation
> http://hadoop.apache.org/zookeeper/docs/r3.0.0/recipes.html, where  
> there is
> only the pseudo-code, I think this situation is avoided. Here there is
> another znode /ready that all nodes have a watch on. And after each  
> node
> writes their own ephemeral child, they don't wait. They read how  
> many of has
> been written and the last one writes the /ready znode and everyone  
> wakes up.
> The only race condition in this one is that there can be two nodes  
> trying to
> write /ready and only one of them will succeed but this is ok.
> Thank you again,
> semih
> On Sat, Mar 5, 2011 at 6:41 PM, Mahadev Konar <mahadev@apache.org>  
> wrote:
>> Semih,
>> You pointed it out right. It is possible ot enter into a situation
>> like that. The recipe does have a bug. It can be fixed with the last
>> client creating a special znode and every node in the list watching
>> for that (so itll be an indication for entering the barrier). no?
>> thanks
>> mahadev
>> On Sat, Mar 5, 2011 at 5:06 PM, Semih Salihoglu <semih@stanford.edu>
>> wrote:
>>> Hi All,
>>> I am new to this group and to ZooKeeper. I was readin the Barrier
>> tutorial
>>> in one of the ZooKeeper documentations.
>>> http://hadoop.apache.org/zookeeper/docs/current/zookeeperTutorial.html 
>>>  .
>> A
>>> barrier primitive is exactly how I want to use ZooKeeper. I have a
>> question
>>> about this example. It's not really a ZooKeeper question, it's  
>>> more a
>>> question about the Barrier primitive I think. Here it is: In the  
>>> enter
>>> method of this Barrier implementation below
>>> boolean enter() throws KeeperException, InterruptedException{
>>>           zk.create(root + "/" + name, new byte[0],  
>>>                   CreateMode.EPHEMERAL_SEQUENTIAL);
>>>           while (true) {
>>>               synchronized (mutex) {
>>>                   List<String> list = zk.getChildren(root, true);
>>>                   if (list.size() < size) {
>>>                       mutex.wait();
>>>                   } else {
>>>                       return true;
>>>                   }
>>>               }
>>>           }
>>>       }
>>> could there be a race condition? Let's say there are two
>>> machines/nodes: node1 and node2 that will use this code to  
>>> synchronize
>>> over ZK. Let's say the following steps take place:
>>>  1. node1 calls the zk.create method and then reads the number of
>>> children, and sees that it's 1 and starts waiting.
>>>  2. node2 calls the zk.create method (doesn't call the
>>> zk.getChildren method yet, let's say it's very slow)
>>>  3. node1 is notified that the number of children on the znode
>>> changed, it checks that the size is 2 so it leaves the barrier, it
>>> does its work and then leaves the barrier, deleting its node.
>>>  4. node2 calls zk.getChildren and because node1 has already left,
>>> it sees that the number of children is equal to 1. Since node1 will
>>> never enter the barrier again, it will keep waiting.
>>> Could this scenario happen? If not, what is preventing this? I  
>>> haven't
>>> copied the code piece that enters barrier-does work-leaves barrier.
>>> But in the link I pasted above, it's the barrierTest(String args[])
>>> method.
>>> Thank you very much in advance,
>>> semih


research scientist

direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301

View raw message