zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeremy Stribling <st...@nicira.com>
Subject Re: znode metadata consistency
Date Tue, 12 Apr 2011 22:27:21 GMT
Ok thank you.  I've created:

https://issues.apache.org/jira/browse/ZOOKEEPER-1046

and left it unassigned for now.

Jeremy

On 04/12/2011 03:09 PM, Fournier, Camille F. [Tech] wrote:
> I have some ideas of where to look on this, happy to do it in the next day or two if
no one else wants to look. Please do open a jira even though you can't reproduce it consistently
yet, just so we have somewhere to track the efforts.
>
> C
>
> -----Original Message-----
> From: Jeremy Stribling [mailto:strib@nicira.com]
> Sent: Tuesday, April 12, 2011 5:11 PM
> To: Vishal Kher
> Cc: user@zookeeper.apache.org
> Subject: Re: znode metadata consistency
>
> I haven't found a way to reproduce it consistently, but if I do I will
> update a JIRA.  Until then, I was mostly just hoping that someone
> on-list could explain whether the meta-data of a parent node is written
> to disk in some consistent way at the same time that a new child node is
> added under the parent.  I would hope that those two operations get
> lumped together in the same transaction, but I haven't yet been able to
> dig into the code myself and confirm that.
>
> On 04/12/2011 02:07 PM, Vishal Kher wrote:
>    
>> Hi Jeremy,
>>
>> It would be good to open a jira for this. In addition to this
>> description, please explain the step to reproduce this problem.
>> Did this problem happen after changing membership? Can you also add to
>> the jira how you perform membership change?
>>
>> -Vishal
>>
>> On Tue, Apr 12, 2011 at 3:33 PM, Jeremy Stribling<strib@nicira.com
>> <mailto:strib@nicira.com>>  wrote:
>>
>>      Unfortunately I'm seeing this issue again in 3.3.3.  A number of
>>      node restarts resulted in a ZNODEEXISTS error when creating a new
>>      sequential znode.  I've got a 3-node setup running on Debian, and
>>      I've collected the full logs here:
>>      http://pdos.csail.mit.edu/~strib/zk_node_exists.tgz
>>      <http://pdos.csail.mit.edu/%7Estrib/zk_node_exists.tgz>  [24 MB,
>>      670 MB unzipped].  The logs are named as
>>      nodeX.<zxid_prefixes>.log, and each new log represents an
>>      application process restart.
>>
>>      Here's the scenario:
>>
>>      1) There's a cluster with nodes 1,2,3 using zxid 0x3.
>>      2) All three nodes restart, forming a cluster of zxid 0x4.
>>      3) Node 3 restarts, leading to a cluster of 0x5.
>>
>>      At this point, it seems like node 1 is the leader of the 0x5
>>      epoch.  In its log (node1.0x4-0x5.log) you can see the first (of
>>      many) instances of the following message:
>>
>>      2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO
>>      org.apache.zookeeper.server.PrepRequestProcessor  - Got user-level
>>      KeeperException when processing sessionid:0x512f466bd44e0002
>>      type:create cxid:0x4da376ab zxid:0xfffffffffffffffe
>>      txntype:unknown reqpath:n/a Error
>>      Path:/zkrsm/00000000000000b2_record0001761440
>>      Error:KeeperErrorCode = NodeExists for
>>      /zkrsm/00000000000000b2_record0001761440
>>
>>      This then repeats forever as my application isn't expecting to
>>      ever get this error message on a sequential node create, and just
>>      continually retries.  The message even transfers over to
>>      node3.0x5-0x6.log once the 0x6 epoch comes into play.
>>
>>      I don't see anything terribly fishy in the transition between the
>>      epochs; the correct snapshots seem to be getting transferred,
>>      etc.  Unfortunately I don't have a ZK snapshot/log that exhibits
>>      the problem when starting with a fresh system.
>>
>>      Does anyone have any ideas?  Maybe it's related to one of the
>>      outstanding election JIRAs (like ZOOKEEPER-975)?  Thanks,
>>
>>      Jeremy
>>
>>      P.S. Some oddities you might notice in these logs:
>>      * Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes
>>      changed due to a bug in our application code.
>>      * We manage node membership dynamically, and our application
>>      restarts the ZooKeeperServer classes whenever a new node wants to
>>      join (without restarting the entire application process).  This is
>>      why you'll see messages like the following in node1.0x4-0x5.log
>>      before a new election begins:
>>
>>      2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888
>>      <http://0.0.0.0:2888>] INFO
>>      org.apache.zookeeper.server.quorum.Learner  - shutdown called
>>
>>
>>
>>      On 03/01/2011 04:09 PM, Jeremy Stribling wrote:
>>      
>>>      My very shaky understanding from skimming those issues was that
>>>      in some cases there were two threads handling different types of
>>>      data that are related to the same transaction -- but maybe that's
>>>      only true when there's a leader and a follower.  But I also saw
>>>      something in there about restoring data from a snapshot vs.
>>>      restoring it from a log, which seems like it could have happened
>>>      in a single node case.
>>>
>>>      In any case, now the 3.3.3 is out I'll give it a try and report
>>>      back if we keep seeing this.
>>>
>>>      Thanks!
>>>
>>>      Jeremy
>>>
>>>      On 03/01/2011 02:59 PM, Vishal Kher wrote:
>>>        
>>>>      Hi Jeremy,
>>>>
>>>>      I just realized that you are using a standalone ZK server. I
>>>>      don't think the bugs apply to you, so I don't have an answer to
>>>>      your question.
>>>>      I think 3.3.3 should be released soon:
>>>>      http://zookeeper-dev.578911.n2.nabble.com/VOTE-Release-ZooKeeper-3-3-3-candidate-1-td6059109.html
>>>>
>>>>      -Vishal
>>>>
>>>>      On Tue, Mar 1, 2011 at 4:15 PM, Jeremy Stribling
>>>>      <strib@nicira.com<mailto:strib@nicira.com>>  wrote:
>>>>
>>>>          Thanks for the pointers Vishal, I hadn't seen those.  They
>>>>          look like they could be related, but without knowing how
>>>>          metadata updates are grouped into transactions, it's hard
>>>>          for me to say.  I would expect the cversion update to happen
>>>>          within the same transaction as the creation of a new child,
>>>>          but if they get written to the log in two separate steps,
>>>>          perhaps these issues could explain it.
>>>>
>>>>          Any estimate on when 3.3.3 will be released?  I haven't seen
>>>>          any updates on the user list about it.  Thanks,
>>>>
>>>>          Jeremy
>>>>
>>>>
>>>>          On 03/01/2011 12:40 PM, Vishal Kher wrote:
>>>>
>>>>              Hi Jermy,
>>>>
>>>>              One of the main reasons for 3.3.3 release was to include
>>>>              fixes for znode
>>>>              inconsistency bugs.
>>>>              Have you taken a look at
>>>>              https://issues.apache.org/jira/browse/ZOOKEEPER-962and
>>>>              https://issues.apache.org/jira/browse/ZOOKEEPER-919?
>>>>              The problem that you are seeing sounds similar to the
>>>>              ones reported.
>>>>
>>>>              -Vishal
>>>>
>>>>
>>>>
>>>>              On Mon, Feb 28, 2011 at 8:04 PM, Jeremy
>>>>              Stribling<strib@nicira.com<mailto:strib@nicira.com>>
>>>>               wrote:
>>>>
>>>>
>>>>                  Hi all,
>>>>
>>>>                  A while back I noticed that my Zookeeper cluster got
>>>>                  into a state where I
>>>>                  would get a "node exists" error back when creating a
>>>>                  sequential znode -- see
>>>>                  the thread starting at
>>>>                  http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%3C4CCA1E2F.9020606@nicira.com%3Efor
>>>>                  more details.  The summary is that at the time, my
>>>>                  application had a bug
>>>>
>>>>                  that could have been improperly bringing new nodes
>>>>                  into a cluster.
>>>>
>>>>                  However, I've seen this a couple more times since
>>>>                  fixing that original bug.
>>>>                   I don't yet know how to reproduce it, but I am
>>>>                  going to keep trying.  In
>>>>                  one case, we restarted a node (in a one-node
>>>>                  cluster), and when it came back
>>>>                  up we could no longer create sequential nodes on a
>>>>                  certain parent node, with
>>>>                  a node exists (-110) error code.  The biggest child
>>>>                  it saw on restart was
>>>>                  /zkrsm/000000000000002d_record0000120804 (i.e., a
>>>>                  sequence number of
>>>>                  120804), however a stat on the parent node revealed
>>>>                  that the cversion was
>>>>                  only 120710:
>>>>
>>>>                  [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
>>>>                  cZxid = 0x5
>>>>                  ctime = Mon Jan 17 18:28:19 PST 2011
>>>>                  mZxid = 0x5
>>>>                  mtime = Mon Jan 17 18:28:19 PST 2011
>>>>                  pZxid = 0x1d819
>>>>                  cversion = 120710
>>>>                  dataVersion = 0
>>>>                  aclVersion = 0
>>>>                  ephemeralOwner = 0x0
>>>>                  dataLength = 0
>>>>                  numChildren = 2955
>>>>
>>>>                  So my question is: how is znode metadata persisted
>>>>                  with respect to the
>>>>                  actual znodes?  Is it possible that a node's
>>>>                  children will get synced to
>>>>                  disk before its own metadata, and if it crashes at a
>>>>                  bad time, the metadata
>>>>                  updates will be lost?  If so, is there any way to
>>>>                  constrain Zookeeper so
>>>>                  that it will sync its metadata before returning
>>>>                  success for write
>>>>                  operations?
>>>>
>>>>                  (I'm using Zookeeper 3.3.2 on a Debian Squeeze
>>>>                  64-bit box, with
>>>>                  openjdk-6-jre 6b18-1.8.3-2.)
>>>>
>>>>                  I'd be happy to create a JIRA for this if that seems
>>>>                  useful, but without a
>>>>                  way to reproduce it I'm not sure that it is.
>>>>
>>>>                  Thanks,
>>>>
>>>>                  Jeremy
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>          
>>      

Mime
View raw message