Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: zookeeper-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com
 designates 209.85.160.48 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:from:date:message-id:subject:to
         :cc:content-type;
        b=IanA7YsJn+oC89JDUt2r6aWHVunnfXNsIWVyv6+RAB2mPniyyvzaIPa7M6ypMhudc/
         5PP+JwTL/JzjVrxZvWyzK5/OtXetU+SqHHTuQ+8j68EJDn04NyX5JFlWjDpiniZWShDl
         jB/vyA3VsFAqLfRVAx6BSKq06WAkAuWQUIXJg=
MIME-Version: 1.0
In-Reply-To: <4C040F35.7000809@apache.org>
References: <AANLkTilRK2LphAsvbMohxnq_iy5tT57J6MSPskCUVZsm@mail.gmail.com>
	<4C040F35.7000809@apache.org>
From: Ted Dunning <ted.dunning@gmail.com>
Date: Mon, 31 May 2010 13:54:37 -0700
Message-ID: <AANLkTikN5N44ygOuGKqhLX6fIoDO-shqiSmBjUpgs7R6@mail.gmail.com>
Subject: Re: Locking and Partial Failure
To: zookeeper-user@hadoop.apache.org
Cc: Charles Gordon <charles.gordon@gmail.com>
Content-Type: multipart/alternative; boundary=005045017dedca8e5a0487ea1149

--005045017dedca8e5a0487ea1149
Content-Type: text/plain; charset=UTF-8

Isn't this a special case of
https://issues.apache.org/jira/browse/ZOOKEEPER-22 ?

Is there any progress on this?

On Mon, May 31, 2010 at 12:34 PM, Patrick Hunt <phunt@apache.org> wrote:

> Hi Charles, any luck with this? Re the issues you found with the recipes
> please enter a JIRA, it would be good to address the problem(s) you found.
> https://issues.apache.org/jira/browse/ZOOKEEPER
>
> re use of session/thread id, might you use some sort of unique token that's
> dynamically assigned to the thread making a request on the shared session?
> The calling code could then be identified by that token in recovery cases.
>
> Patrick
>
> On 05/28/2010 08:28 AM, Charles Gordon wrote:
>
>> Hello,
>>
>> I am new to using Zookeeper and I have a quick question about the locking
>> recipe that can be found here:
>>
>>
>> http://hadoop.apache.org/zookeeper/docs/r3.1.2/recipes.html#sc_recipes_Locks
>>
>> It appears to me that there is a flaw in this algorithm related to partial
>> failure, and I am curious to know how to fix it.
>>
>> The algorithm follows these steps:
>>
>>  1. Call "create()" with a pathname like
>> "/some/path/to/parent/child-lock-".
>>  2. Call "getChildren()" on the lock node without the watch flag set.
>>  3. If the path created in step (1) has the lowest sequence number, you
>> are
>> the master (skip the next steps).
>>  4. Otherwise, call "exists()" with the watch flag set on the child with
>> the
>> next lowest sequence number.
>>  5. If "exists()" returns false, go to step (2), otherwise wait for a
>> notification from the path, then go to step (2).
>>
>> The scenario that seems to be faulty is a partial failure in step (1).
>> Assume that my client program follows step (1) and calls "create()".
>> Assume
>> that the call succeeds on the Zookeeper server, but there is a
>> ConnectionLoss event right as the server sends the response (e.g., a
>> network
>> partition, some dropped packets, the ZK server goes down, etc). Assume
>> further that the client immediately reconnects, so the session is not
>> timed
>> out. At this point there is a child node that was created by my client,
>> but
>> that my client does not know about (since it never received the response).
>> Since my client doesn't know about the child, it won't know to watch the
>> previous child to it, and it also won't know to delete it. That means all
>> clients using that lock will fail to make progress as soon as the orphaned
>> child is the lowest sequence number. This state will continue until my
>> client closes it's session (which may be a while if I have a long lived
>> session, as I would like to have). Correctness is maintained here, but
>> live-ness is not.
>>
>> The only good solution I have found for this problem is to establish a new
>> session with Zookeeper before acquiring a lock, and to close that session
>> immediately upon any connection loss in step (1). If everything works, the
>> session could be re-used, but you'd need to guarantee that the session was
>> closed if there was a failure during creation of the child node. Are there
>> other good solutions?
>>
>> I looked at the sample code that comes with the Zookeeper distribution
>> (I'm
>> using 3.2.2 right now), and it uses the current session ID as part of the
>> child node name. Then, if there is a failure during creation, it tries to
>> look up the child using that session ID. This isn't really helpful in the
>> environment I'm using, where a single session could be shared by multiple
>> threads, any of which could request a lock (so I can't uniquely identify a
>> lock by session ID). I could use thread ID, but then I run the risk of a
>> thread being reused and getting the wrong lock. In any case, there is also
>> the risk that a second failure prevents me from looking up the lock after
>> a
>> connection loss, so I'm right back to an orphaned lock child, as above. I
>> could, presumably, be careful enough with try/catch logic to prevent even
>> that case, but it makes for pretty bug-prone code. Also, as a side note,
>> that code appears to be sorting the child nodes by the session ID first,
>> then the sequence number, which could cause locks to be ordered
>> incorrectly.
>>
>> Thanks for any help you can provide!
>>
>> Charles Gordon
>>
>>

--005045017dedca8e5a0487ea1149--