Return-Path: X-Original-To: apmail-curator-user-archive@minotaur.apache.org Delivered-To: apmail-curator-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1715910EDF for ; Tue, 20 Jan 2015 16:45:31 +0000 (UTC) Received: (qmail 14766 invoked by uid 500); 20 Jan 2015 16:45:31 -0000 Delivered-To: apmail-curator-user-archive@curator.apache.org Received: (qmail 14726 invoked by uid 500); 20 Jan 2015 16:45:30 -0000 Mailing-List: contact user-help@curator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@curator.apache.org Delivered-To: mailing list user@curator.apache.org Received: (qmail 14716 invoked by uid 99); 20 Jan 2015 16:45:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2015 16:45:30 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.220.45] (HELO mail-pa0-f45.google.com) (209.85.220.45) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jan 2015 16:45:05 +0000 Received: by mail-pa0-f45.google.com with SMTP id lf10so46757953pab.4 for ; Tue, 20 Jan 2015 08:42:27 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:message-id:in-reply-to:references :subject:mime-version:content-type; bh=Fc4swUc0D8wCY2BxqfqswmtsnLU+xfpjSAPEM1E9NAs=; b=MT5xrV2J8YqVg03yAAnNHGOC2yOjs0SwKWGphEjTpiwPfeTxbSz698WM16GzRCf/gZ qa0uwa/Uc/Dws0DgQXAZmwMSXPwoqRez25Vo1Isoa+CeZhw72F/K/pcHs6021c/K5qPv 6H+JVnXAtLKqucx5d8q2LmXbJE+luOTmoxlD4/qo2l6xLNaY8x1Ul/+0SFdV58E90etT LuotCcgVlXGYRxtIERZI0pDSaMTnGWFmDRjBG1WnFu8kx70q/zzHuH8tZZorf4lvXOYB 54mtNiw/Q/QBzKH/hGyJ7R9vKZt471jr0xDGQUqt73wPQAZedLv9+iF+EWg+BGB+/7Zk lpXQ== X-Gm-Message-State: ALoCoQmJKl3iPm97wjz53h7T8/YX9jK7ES3l5u5G6eevxGqcTrOmPEGfYNSIfgyEKBgysy94zmpU X-Received: by 10.66.157.67 with SMTP id wk3mr54723346pab.95.1421772147431; Tue, 20 Jan 2015 08:42:27 -0800 (PST) Received: from Jordans-MacBook-Pro.local ([190.141.38.38]) by mx.google.com with ESMTPSA id oi5sm3472678pbb.7.2015.01.20.08.42.24 (version=TLSv1.2 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 20 Jan 2015 08:42:26 -0800 (PST) Date: Tue, 20 Jan 2015 11:42:21 -0500 From: Jordan Zimmerman To: John Vines , user@curator.apache.org Message-ID: In-Reply-To: References: Subject: Re: InterProcessMutex doesn't detect deletion of lock file X-Mailer: Airmail (286) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="54be856d_70a64e2a_2608" X-Virus-Checked: Checked by ClamAV on apache.org --54be856d_70a64e2a_2608 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline But manually deleting the lock node is not normal behavior. It should nev= er happen in production. Can you explain the scenario in more detail=3F=C2= =A0 -JZ On January 20, 2015 at 10:47:20 AM, John Vines (vines=40apache.org) wrote= : Sounds similar to=C2=A0https://issues.apache.org/jira/browse/CURATOR-171 On Tue, Jan 20, 2015 at 10:23 AM, Michael Peterson w= rote: Hi, I am fairly new to Curator and ZK, so apologies if this is has been asked= before.=C2=A0 I haven't found anything yet that addresses it. My ZK use case is very simple - HA failover.=C2=A0 Two processes get laun= ched - one does the work and the other waits to take over in case the oth= er dies or otherwise stops working. The Curator InterProcessMutex fits the bill.=C2=A0 However, without too m= uch effort I've found a scenario where both Process A and Process B both = think they are the owner at the same time and start doing the work, causi= ng data corruption. The scenario is simply to delete the lock file, which I did via the ZK CL= I (zkCli.sh).=C2=A0 The problem is that the InterProcessMutex currently h= olding the lock doesn't seem to notice that the lock file got deleted, bu= t the InterProcessMutex in the waiting (failover) process *does* notice a= nd creates a new lock and starts doing work. Does the InterProcessMutex set a watch on the lock file it creates=3F=C2=A0= If not, why not=3F Idea =231: I tried setting all the Listeners I could figure how to set to detect the= NodeDeleted event: - CuratorListener - ConnectionStateListener - UnhandledErrorListener but none get signaled when I manually delete the lock file. Idea =232: Is the solution to set my own watch on the lock file that the IPMutex cre= ated=3F=C2=A0 If so, I see that one way to get the file name of the lock = is to call InterProcessMutex=23getParticipantNodes().=C2=A0 But the probl= em is that there can be more than one lock file - it seems =C2=A0=C2=A0=C2=A0 =5Bzk: localhost:2181(CONNECTED) 7=5D ls /XXX/masterlo= ck =C2=A0=C2=A0=C2=A0 =5B=5Fc=5Fc1dc399d-b6e4-4051-bd5c-2e300e62bc58-lock-00= 00000003, =5Fc=5Fbf5de8b2-ed33-4f89-a737-4061f2072c3f-lock-0000000000=5D =C2=A0=C2=A0=C2=A0 =5Bzk: localhost:2181(CONNECTED) 37=5D ls /XXX/masterl= ock =C2=A0=C2=A0=C2=A0 =5B=5Fc=5F63490235-7ab6-461d-bab2-401d4439db4f-lock-00= 00000018, =5C =C2=A0=C2=A0=C2=A0=C2=A0 =5Fc=5F1e57c64e-b990-4f9a-96f9-fccf56c0421e-lock= -0000000012, =5C =C2=A0=C2=A0=C2=A0=C2=A0 =5Fc=5Ff09ee1e5-0e47-47a7-961e-d7745ffbfc28-lock= -0000000017, =5C =C2=A0=C2=A0=C2=A0=C2=A0 =5Fc=5F2f9ebe06-b91c-4886-b916-34ff1fa83541-lock= -0000000016=5D And it seems that I can't use the one with the smallest sequential lock n= umber, because the smallest one might be hanging around from a crashed lo= ckholder and it has expired yet - that is the case in the above example: = lock-00000012 is just waiting to be expired after a crash. So I don't know how to tell which lock is =22mine=22 to set a watch on us= ing that method. Idea =233: I see that the InterProcessMutex also takes an optional =60LockInternalsD= river=60 argument.=C2=A0 I looked into that code and there I see that it = has access to the lock file name.=C2=A0 In addition, in the =60getsTheLoc= k=60 method it creates a PredicateResults object with a =60pathToWatch=60= arg, which sounds promising, but in the default impl with my setup that = pathToWatch is null.=C2=A0 So I then created my own CustomLockInternalsDriver and put the lock-file = name in pathToWatch (not sure that would work), but when I set =60pathToW= atch=60 to the actual lock path, still nothing happens when I delete the = file. So then I recorded the path to my lock in the CustomLockInternalsDriver s= o I could get it in my mainline code and set a WATCH manually/myself.=C2=A0= That ends up working.=C2=A0 But that's a lot of work and it's not at all= clear what the right solution is and whether it is dangerous to fiddle w= ith creating my own LockInternalsDriver impl. What is the right way to solve this issue=3F --- How to REPRODUCE --- Here's a link to a gist with my test code:=C2=A0=C2=A0 https://gist.githu= b.com/quux00/f6be8fe223a7832ef514 Also a gist to my CustomLockInternalsDriver: https://gist.github.com/quux= 00/ab37cedc46cb5368c853 Start up two instances of that code. One will indicate it is =22working=22= and the other =22waiting=22. I then use zkCli.sh to delete the file: =C2=A0=C2=A0=C2=A0 =24 ./zkCli.sh =C2=A0=C2=A0=C2=A0 =5Bzk: localhost:2181(CONNECTED) 111=5D ls /XXX/master= lock =C2=A0=C2=A0=C2=A0 =5B=5Fc=5Ffd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-00= 00000006=5D =C2=A0=C2=A0=C2=A0 =5Bzk: localhost:2181(CONNECTED) 112=5D delete /XXX/ma= sterlock/=5Fc=5Ffd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006 =C2=A0=C2=A0=C2=A0 =5Bzk: localhost:2181(CONNECTED) 113=5D ls /XXX/master= lock =C2=A0=C2=A0=C2=A0 =5B=5D The =22waiting=22 process will now create a new lock file and now both pr= ocesses are =22working=22. Thank you, Michael --54be856d_70a64e2a_2608 Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline