zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Stribling <st...@nicira.com>
Subject Re: znode metadata consistency
Date Tue, 12 Apr 2011 19:33:22 GMT
Unfortunately I'm seeing this issue again in 3.3.3.  A number of node 
restarts resulted in a ZNODEEXISTS error when creating a new sequential 
znode.  I've got a 3-node setup running on Debian, and I've collected 
the full logs here: http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz 
[24 MB, 670 MB unzipped].  The logs are named as 
nodeX.<zxid_prefixes>.log, and each new log represents an application 
process restart.

Here's the scenario:

1) There's a cluster with nodes 1,2,3 using zxid 0x3.
2) All three nodes restart, forming a cluster of zxid 0x4.
3) Node 3 restarts, leading to a cluster of 0x5.

At this point, it seems like node 1 is the leader of the 0x5 epoch.  In 
its log (node1.0x4-0x5.log) you can see the first (of many) instances of 
the following message:

2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO 
org.apache.zookeeper.server.PrepRequestProcessor  - Got user-level 
KeeperException when processing sessionid:0x512f466bd44e0002 type:create 
cxid:0x4da376ab zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a 
Error Path:/zkrsm/00000000000000b2_record0001761440 
Error:KeeperErrorCode = NodeExists for 

This then repeats forever as my application isn't expecting to ever get 
this error message on a sequential node create, and just continually 
retries.  The message even transfers over to node3.0x5-0x6.log once the 
0x6 epoch comes into play.

I don't see anything terribly fishy in the transition between the 
epochs; the correct snapshots seem to be getting transferred, etc.  
Unfortunately I don't have a ZK snapshot/log that exhibits the problem 
when starting with a fresh system.

Does anyone have any ideas?  Maybe it's related to one of the 
outstanding election JIRAs (like ZOOKEEPER-975)?  Thanks,


P.S. Some oddities you might notice in these logs:
* Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes changed due 
to a bug in our application code.
* We manage node membership dynamically, and our application restarts 
the ZooKeeperServer classes whenever a new node wants to join (without 
restarting the entire application process).  This is why you'll see 
messages like the following in node1.0x4-0x5.log before a new election 

2011-04-11 21:16:00,762 4804 [QuorumPeer:/] INFO 
org.apache.zookeeper.server.quorum.Learner  - shutdown called

On 03/01/2011 04:09 PM, Jeremy Stribling wrote:
> My very shaky understanding from skimming those issues was that in 
> some cases there were two threads handling different types of data 
> that are related to the same transaction -- but maybe that's only true 
> when there's a leader and a follower.  But I also saw something in 
> there about restoring data from a snapshot vs. restoring it from a 
> log, which seems like it could have happened in a single node case.
> In any case, now the 3.3.3 is out I'll give it a try and report back 
> if we keep seeing this.
> Thanks!
> Jeremy
> On 03/01/2011 02:59 PM, Vishal Kher wrote:
>> Hi Jeremy,
>> I just realized that you are using a standalone ZK server. I don't 
>> think the bugs apply to you, so I don't have an answer to your question.
>> I think 3.3.3 should be released soon: 
>> http://zookeeper-dev.578911.n2.nabble.com/VOTE-Release-ZooKeeper-3-3-3-candidate-1-td6059109.html
>> -Vishal
>> On Tue, Mar 1, 2011 at 4:15 PM, Jeremy Stribling <strib@nicira.com 
>> <mailto:strib@nicira.com>> wrote:
>>     Thanks for the pointers Vishal, I hadn't seen those.  They look
>>     like they could be related, but without knowing how metadata
>>     updates are grouped into transactions, it's hard for me to say.
>>      I would expect the cversion update to happen within the same
>>     transaction as the creation of a new child, but if they get
>>     written to the log in two separate steps, perhaps these issues
>>     could explain it.
>>     Any estimate on when 3.3.3 will be released?  I haven't seen any
>>     updates on the user list about it.  Thanks,
>>     Jeremy
>>     On 03/01/2011 12:40 PM, Vishal Kher wrote:
>>         Hi Jermy,
>>         One of the main reasons for 3.3.3 release was to include
>>         fixes for znode
>>         inconsistency bugs.
>>         Have you taken a look at
>>         https://issues.apache.org/jira/browse/ZOOKEEPER-962and
>>         https://issues.apache.org/jira/browse/ZOOKEEPER-919?
>>         The problem that you are seeing sounds similar to the ones
>>         reported.
>>         -Vishal
>>         On Mon, Feb 28, 2011 at 8:04 PM, Jeremy
>>         Stribling<strib@nicira.com <mailto:strib@nicira.com>>  wrote:
>>             Hi all,
>>             A while back I noticed that my Zookeeper cluster got into
>>             a state where I
>>             would get a "node exists" error back when creating a
>>             sequential znode -- see
>>             the thread starting at
>>             http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%3C4CCA1E2F.9020606@nicira.com%3Efor
>>             more details.  The summary is that at the time, my
>>             application had a bug
>>             that could have been improperly bringing new nodes into a
>>             cluster.
>>             However, I've seen this a couple more times since fixing
>>             that original bug.
>>              I don't yet know how to reproduce it, but I am going to
>>             keep trying.  In
>>             one case, we restarted a node (in a one-node cluster),
>>             and when it came back
>>             up we could no longer create sequential nodes on a
>>             certain parent node, with
>>             a node exists (-110) error code.  The biggest child it
>>             saw on restart was
>>             /zkrsm/000000000000002d_record0000120804 (i.e., a
>>             sequence number of
>>             120804), however a stat on the parent node revealed that
>>             the cversion was
>>             only 120710:
>>             [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
>>             cZxid = 0x5
>>             ctime = Mon Jan 17 18:28:19 PST 2011
>>             mZxid = 0x5
>>             mtime = Mon Jan 17 18:28:19 PST 2011
>>             pZxid = 0x1d819
>>             cversion = 120710
>>             dataVersion = 0
>>             aclVersion = 0
>>             ephemeralOwner = 0x0
>>             dataLength = 0
>>             numChildren = 2955
>>             So my question is: how is znode metadata persisted with
>>             respect to the
>>             actual znodes?  Is it possible that a node's children
>>             will get synced to
>>             disk before its own metadata, and if it crashes at a bad
>>             time, the metadata
>>             updates will be lost?  If so, is there any way to
>>             constrain Zookeeper so
>>             that it will sync its metadata before returning success
>>             for write
>>             operations?
>>             (I'm using Zookeeper 3.3.2 on a Debian Squeeze 64-bit
>>             box, with
>>             openjdk-6-jre 6b18-1.8.3-2.)
>>             I'd be happy to create a JIRA for this if that seems
>>             useful, but without a
>>             way to reproduce it I'm not sure that it is.
>>             Thanks,
>>             Jeremy

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message