zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vishal Kher <vishalm...@gmail.com>
Subject Re: znode metadata consistency
Date Tue, 12 Apr 2011 21:07:40 GMT
Hi Jeremy,

It would be good to open a jira for this. In addition to this description,
please explain the step to reproduce this problem.
Did this problem happen after changing membership? Can you also add to the
jira how you perform membership change?

-Vishal

On Tue, Apr 12, 2011 at 3:33 PM, Jeremy Stribling <strib@nicira.com> wrote:

>  Unfortunately I'm seeing this issue again in 3.3.3.  A number of node
> restarts resulted in a ZNODEEXISTS error when creating a new sequential
> znode.  I've got a 3-node setup running on Debian, and I've collected the
> full logs here: http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz [24
> MB, 670 MB unzipped].  The logs are named as nodeX.<zxid_prefixes>.log, and
> each new log represents an application process restart.
>
> Here's the scenario:
>
> 1) There's a cluster with nodes 1,2,3 using zxid 0x3.
> 2) All three nodes restart, forming a cluster of zxid 0x4.
> 3) Node 3 restarts, leading to a cluster of 0x5.
>
> At this point, it seems like node 1 is the leader of the 0x5 epoch.  In its
> log (node1.0x4-0x5.log) you can see the first (of many) instances of the
> following message:
>
> 2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO
> org.apache.zookeeper.server.PrepRequestProcessor  - Got user-level
> KeeperException when processing sessionid:0x512f466bd44e0002 type:create
> cxid:0x4da376ab zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
> Path:/zkrsm/00000000000000b2_record0001761440 Error:KeeperErrorCode =
> NodeExists for /zkrsm/00000000000000b2_record0001761440
>
> This then repeats forever as my application isn't expecting to ever get
> this error message on a sequential node create, and just continually
> retries.  The message even transfers over to node3.0x5-0x6.log once the 0x6
> epoch comes into play.
>
> I don't see anything terribly fishy in the transition between the epochs;
> the correct snapshots seem to be getting transferred, etc.  Unfortunately I
> don't have a ZK snapshot/log that exhibits the problem when starting with a
> fresh system.
>
> Does anyone have any ideas?  Maybe it's related to one of the outstanding
> election JIRAs (like ZOOKEEPER-975)?  Thanks,
>
> Jeremy
>
> P.S. Some oddities you might notice in these logs:
> * Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes changed due to
> a bug in our application code.
> * We manage node membership dynamically, and our application restarts the
> ZooKeeperServer classes whenever a new node wants to join (without
> restarting the entire application process).  This is why you'll see messages
> like the following in node1.0x4-0x5.log before a new election begins:
>
> 2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888] INFO
> org.apache.zookeeper.server.quorum.Learner  - shutdown called
>
>
>
> On 03/01/2011 04:09 PM, Jeremy Stribling wrote:
>
> My very shaky understanding from skimming those issues was that in some
> cases there were two threads handling different types of data that are
> related to the same transaction -- but maybe that's only true when there's a
> leader and a follower.  But I also saw something in there about restoring
> data from a snapshot vs. restoring it from a log, which seems like it could
> have happened in a single node case.
>
> In any case, now the 3.3.3 is out I'll give it a try and report back if we
> keep seeing this.
>
> Thanks!
>
> Jeremy
>
> On 03/01/2011 02:59 PM, Vishal Kher wrote:
>
> Hi Jeremy,
>
> I just realized that you are using a standalone ZK server. I don't think
> the bugs apply to you, so I don't have an answer to your question.
> I think 3.3.3 should be released soon:
> http://zookeeper-dev.578911.n2.nabble.com/VOTE-Release-ZooKeeper-3-3-3-candidate-1-td6059109.html
>
> -Vishal
>
> On Tue, Mar 1, 2011 at 4:15 PM, Jeremy Stribling <strib@nicira.com> wrote:
>
>> Thanks for the pointers Vishal, I hadn't seen those.  They look like they
>> could be related, but without knowing how metadata updates are grouped into
>> transactions, it's hard for me to say.  I would expect the cversion update
>> to happen within the same transaction as the creation of a new child, but if
>> they get written to the log in two separate steps, perhaps these issues
>> could explain it.
>>
>> Any estimate on when 3.3.3 will be released?  I haven't seen any updates
>> on the user list about it.  Thanks,
>>
>> Jeremy
>>
>>
>> On 03/01/2011 12:40 PM, Vishal Kher wrote:
>>
>>> Hi Jermy,
>>>
>>> One of the main reasons for 3.3.3 release was to include fixes for znode
>>> inconsistency bugs.
>>> Have you taken a look at
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-962and
>>> https://issues.apache.org/jira/browse/ZOOKEEPER-919?
>>> The problem that you are seeing sounds similar to the ones reported.
>>>
>>> -Vishal
>>>
>>>
>>>
>>> On Mon, Feb 28, 2011 at 8:04 PM, Jeremy Stribling<strib@nicira.com>
>>>  wrote:
>>>
>>>
>>>
>>>> Hi all,
>>>>
>>>> A while back I noticed that my Zookeeper cluster got into a state where
>>>> I
>>>> would get a "node exists" error back when creating a sequential znode --
>>>> see
>>>> the thread starting at
>>>>
>>>> http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%3C4CCA1E2F.9020606@nicira.com%3Eformore
details.  The summary is that at the time, my application had a bug
>>>>
>>>> that could have been improperly bringing new nodes into a cluster.
>>>>
>>>> However, I've seen this a couple more times since fixing that original
>>>> bug.
>>>>  I don't yet know how to reproduce it, but I am going to keep trying.
>>>>  In
>>>> one case, we restarted a node (in a one-node cluster), and when it came
>>>> back
>>>> up we could no longer create sequential nodes on a certain parent node,
>>>> with
>>>> a node exists (-110) error code.  The biggest child it saw on restart
>>>> was
>>>> /zkrsm/000000000000002d_record0000120804 (i.e., a sequence number of
>>>> 120804), however a stat on the parent node revealed that the cversion
>>>> was
>>>> only 120710:
>>>>
>>>> [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
>>>> cZxid = 0x5
>>>> ctime = Mon Jan 17 18:28:19 PST 2011
>>>> mZxid = 0x5
>>>> mtime = Mon Jan 17 18:28:19 PST 2011
>>>> pZxid = 0x1d819
>>>> cversion = 120710
>>>> dataVersion = 0
>>>> aclVersion = 0
>>>> ephemeralOwner = 0x0
>>>> dataLength = 0
>>>> numChildren = 2955
>>>>
>>>> So my question is: how is znode metadata persisted with respect to the
>>>> actual znodes?  Is it possible that a node's children will get synced to
>>>> disk before its own metadata, and if it crashes at a bad time, the
>>>> metadata
>>>> updates will be lost?  If so, is there any way to constrain Zookeeper so
>>>> that it will sync its metadata before returning success for write
>>>> operations?
>>>>
>>>> (I'm using Zookeeper 3.3.2 on a Debian Squeeze 64-bit box, with
>>>> openjdk-6-jre 6b18-1.8.3-2.)
>>>>
>>>> I'd be happy to create a JIRA for this if that seems useful, but without
>>>> a
>>>> way to reproduce it I'm not sure that it is.
>>>>
>>>> Thanks,
>>>>
>>>> Jeremy
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message