zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vishal K (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ZOOKEEPER-1046) Creating a new sequential node results in a ZNODEEXISTS error
Date Fri, 06 May 2011 16:09:03 GMT

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Vishal K updated ZOOKEEPER-1046:
--------------------------------

    Attachment: ZOOKEEPER-1046.patch

Hi Camille, Jeremy,

I have attached a patch that I think should fix the problem that Camille
pointed out.

There are subtle effects of this patch that needs to be carefully
reviewed:

1. Previously, DataTree.processTxn() did not return error if
create/delete failed with nodeexist/nonode exception (the if condition
was a bit strange also). I think we should not do this. Per my
understanding, such a failure can occur only when we are replaying the
transactions. It should not occur when the server is doing live
transactions (unless there is a bug). So processTxn should return
error even in this case to the caller. The caller should decided
whether to ignore these errors or not. In FileTxnSnapLog we handle the
error by incrementing cversion of the parent. In FinalRequestProcessor
we do not do anything special, but we do not ignore the error as well.

2. We could increment cversion in DeleteNode itself (as I mentioned in
my previous comment). But I think changes in the attached patch is a
better approach. I would like to confirm whether not incrementing
cversion in DeleteNode() is going to introduce any synchronization
issue (since the lock on the parent will be released and then
reacquired). I don't think there is any problem here, but it will be
good if you can confirm this.

I noticed that the same bug can occur while creating a node as
well. Patch handles create and delete.

I was able to test this patch using the snapshots and logs that Jermey
had attached. I had to revert the logs to the state where this bug
would be seen the first time. I ended up doing: rm \*.1d8\* \*.1d9\*

Then I can see that "create -s /zkrsm/000000000000002d_record test"
succeeds with this patch (and fails without).

I have not thought about writing a test for this yet (sounds complex to me).

Thanks.
-Vishal


> Creating a new sequential node results in a ZNODEEXISTS error
> -------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1046
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1046
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.2, 3.3.3
>         Environment: A 3 node-cluster running Debian squeeze.
>            Reporter: Jeremy Stribling
>              Labels: sequence
>             Fix For: 3.4.0
>
>         Attachments: ZOOKEEPER-1046.patch, ZOOKEEPER-1046.patch, ZOOKEEPER-1046.tgz
>
>
> On several occasions, I've seen a create() with the sequential flag set fail with a ZNODEEXISTS
error, and I don't think that should ever be possible.  In past runs, I've been able to closely
inspect the state of the system with the command line client, and saw that the parent znode's
cversion is smaller than the sequential number of existing children znode under that parent.
 In one example:
> {noformat}
> [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
> cZxid = 0x5
> ctime = Mon Jan 17 18:28:19 PST 2011
> mZxid = 0x5
> mtime = Mon Jan 17 18:28:19 PST 2011
> pZxid = 0x1d819
> cversion = 120710
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 0
> numChildren = 2955
> {noformat}
> However, the znode /zkrsm/000000000000002d_record0000120804 existed on disk.
> In a recent run, I was able to capture the Zookeeper logs, and I will attach them to
this JIRA.  The logs are named as nodeX.<zxid_prefixes>.log, and each new log represents
an application process restart.
> Here's the scenario:
> # There's a cluster with nodes 1,2,3 using zxid 0x3.
> # All three nodes restart, forming a cluster of zxid 0x4.
> # Node 3 restarts, leading to a cluster of 0x5.
> At this point, it seems like node 1 is the leader of the 0x5 epoch.  In its log (node1.0x4-0x5.log)
you can see the first (of many) instances of the following message:
> {noformat}
> 2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO org.apache.zookeeper.server.PrepRequestProcessor
 - Got user-level KeeperException when processing sessionid:0x512f466bd44e0002 type:create
cxid:0x4da376ab zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error Path:/zkrsm/00000000000000b2_record0001761440
Error:KeeperErrorCode = NodeExists for /zkrsm/00000000000000b2_record0001761440
> {noformat}
> This then repeats forever as my application isn't expecting to ever get this error message
on a sequential node create, and just continually retries.  The message even transfers over
to node3.0x5-0x6.log once the 0x6 epoch comes into play.
> I don't see anything terribly fishy in the transition between the epochs; the correct
snapshots seem to be getting transferred, etc.  Unfortunately I don't have a ZK snapshot/log
that exhibits the problem when starting with a fresh system.
> Some oddities you might notice in these logs:
> * Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes changed due to a bug in
our application code.  (They are assigned randomly, but are supposed to be consistent across
restarts.)
> * We manage node membership dynamically, and our application restarts the ZooKeeperServer
classes whenever a new node wants to join (without restarting the entire application process).
 This is why you'll see messages like the following in node1.0x4-0x5.log before a new election
begins:
> {noformat}
> 2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888] INFO org.apache.zookeeper.server.quorum.Learner
 - shutdown called
> {noformat}
> * There is in fact one of these dynamic membership changes in node1.0x4-0x5.log, just
before the 0x4 epoch is formed.  I'm not sure how this would be related though, as no transactions
are done during this period.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message