curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Drob <mad...@cloudera.com>
Subject Re: InterProcessMutex doesn't detect deletion of lock file
Date Tue, 20 Jan 2015 17:55:19 GMT
I'm with Jordan on this. I would not expect Curator to continue to work if
somebody did "rm -rf /" on their server. I also do not think this is a
situation that we should have to account for, despite personally having
seen it more than once. I'll admit that I'm exaggerating the argument a
little bit, but at some point we have to trust that underlying
infrastructure, like the file system, will be there and operators won't
break it.

On Tue, Jan 20, 2015 at 11:48 AM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> In the many years of Curators’ existence no one that I know has had an
> issue with this. ZooKeeper is very robust and nodes do not get deleted
> abnormally like this. You are posing a hypothetical situation. It’s not
> reasonable to handle every single edge case. This would be the equivalent
> of someone going into the production database and arbitrarily deleting
> records. The locking code is already incredibly complicated and I wouldn’t
> want to burden it with this new behavior and overhead. However, if you can
> make it work reasonably please provide a PR and the committers will look at
> it.
>
> -Jordan
>
>
>
> On January 20, 2015 at 12:38:36 PM, Michael Peterson (quux00@gmail.com)
> wrote:
>
> > But manually deleting the lock node is not normal behavior.
> > It should never happen in production.
>
> I agree that it would be abnormal.  But abnormal doesn't mean impossible.
>
> > Can you explain the scenario in more detail?
>
> There may be a bug in ZK (now or in the future) that in some rare cases
> deletes a file when it should not.
>
> Or a team might in the practice of managing their ZK ensemble via the ZK
> CLI and someone might accidentally type:
> "delete /XXX/masterlock
> /_c_c6101d8e-5af2-4290-8bc6-4005048c9a77-lock-0000000000"
>
> rather than
>
> "get /XXX/masterlock
> /_c_c6101d8e-5af2-4290-8bc6-4005048c9a77-lock-0000000000".
>
> Or even worse, type
> "rmr /XXX/masterlock".
>
> (I've seen a somewhat similar manual mistake done on HDFS of a production
> Hadoop system where months of data was deleted using up-arrow too fast
> and issuing a -rmr instead of -ls cmd.)
>
> For a system where I need to be absolutely sure that I and only I have the
> lock, this abnormal "backdoor" deletion possibility worries me.  To build a
> truly robust system, you have to handle all the possibilities you can.
>
> The https://issues.apache.org/jira/browse/CURATOR-171 issue referenced
> earlier seems to be arguing the same thing.
>
>
> On Tue, Jan 20, 2015 at 11:42 AM, Jordan Zimmerman <jordan@jordanzimmerman
> .com> wrote:
>
>>  But manually deleting the lock node is not normal behavior. It should
>> never happen in production. Can you explain the scenario in more detail?
>>
>>  -JZ
>>
>>

Mime
View raw message