Return-Path: Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: (qmail 96959 invoked from network); 1 Mar 2011 21:15:41 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Mar 2011 21:15:41 -0000 Received: (qmail 9268 invoked by uid 500); 1 Mar 2011 21:15:41 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 9209 invoked by uid 500); 1 Mar 2011 21:15:41 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 9201 invoked by uid 99); 1 Mar 2011 21:15:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Mar 2011 21:15:40 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: unknown (athena.apache.org: error in processing during lookup of strib@nicira.com) Received: from [209.85.161.42] (HELO mail-fx0-f42.google.com) (209.85.161.42) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Mar 2011 21:15:34 +0000 Received: by fxm20 with SMTP id 20so6287092fxm.15 for ; Tue, 01 Mar 2011 13:15:12 -0800 (PST) Received: by 10.223.83.194 with SMTP id g2mr8748207fal.75.1299014112489; Tue, 01 Mar 2011 13:15:12 -0800 (PST) Received: from [172.16.0.50] ([66.201.54.10]) by mx.google.com with ESMTPS id n26sm711439fam.13.2011.03.01.13.15.10 (version=SSLv3 cipher=OTHER); Tue, 01 Mar 2011 13:15:11 -0800 (PST) Message-ID: <4D6D61DC.3060603@nicira.com> Date: Tue, 01 Mar 2011 13:15:08 -0800 From: Jeremy Stribling User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.10) Gecko/20100619 Icedove/3.0.5 MIME-Version: 1.0 To: user@zookeeper.apache.org CC: Vishal Kher Subject: Re: znode metadata consistency References: <4D6C4603.4010207@nicira.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Thanks for the pointers Vishal, I hadn't seen those. They look like they could be related, but without knowing how metadata updates are grouped into transactions, it's hard for me to say. I would expect the cversion update to happen within the same transaction as the creation of a new child, but if they get written to the log in two separate steps, perhaps these issues could explain it. Any estimate on when 3.3.3 will be released? I haven't seen any updates on the user list about it. Thanks, Jeremy On 03/01/2011 12:40 PM, Vishal Kher wrote: > Hi Jermy, > > One of the main reasons for 3.3.3 release was to include fixes for znode > inconsistency bugs. > Have you taken a look at https://issues.apache.org/jira/browse/ZOOKEEPER-962and > https://issues.apache.org/jira/browse/ZOOKEEPER-919? > The problem that you are seeing sounds similar to the ones reported. > > -Vishal > > > > On Mon, Feb 28, 2011 at 8:04 PM, Jeremy Stribling wrote: > > >> Hi all, >> >> A while back I noticed that my Zookeeper cluster got into a state where I >> would get a "node exists" error back when creating a sequential znode -- see >> the thread starting at >> http://mail-archives.apache.org/mod_mbox/hadoop-zookeeper-user/201010.mbox/%3C4CCA1E2F.9020606@nicira.com%3Efor more details. The summary is that at the time, my application had a bug >> that could have been improperly bringing new nodes into a cluster. >> >> However, I've seen this a couple more times since fixing that original bug. >> I don't yet know how to reproduce it, but I am going to keep trying. In >> one case, we restarted a node (in a one-node cluster), and when it came back >> up we could no longer create sequential nodes on a certain parent node, with >> a node exists (-110) error code. The biggest child it saw on restart was >> /zkrsm/000000000000002d_record0000120804 (i.e., a sequence number of >> 120804), however a stat on the parent node revealed that the cversion was >> only 120710: >> >> [zk:(CONNECTED) 3] stat /zkrsm >> cZxid = 0x5 >> ctime = Mon Jan 17 18:28:19 PST 2011 >> mZxid = 0x5 >> mtime = Mon Jan 17 18:28:19 PST 2011 >> pZxid = 0x1d819 >> cversion = 120710 >> dataVersion = 0 >> aclVersion = 0 >> ephemeralOwner = 0x0 >> dataLength = 0 >> numChildren = 2955 >> >> So my question is: how is znode metadata persisted with respect to the >> actual znodes? Is it possible that a node's children will get synced to >> disk before its own metadata, and if it crashes at a bad time, the metadata >> updates will be lost? If so, is there any way to constrain Zookeeper so >> that it will sync its metadata before returning success for write >> operations? >> >> (I'm using Zookeeper 3.3.2 on a Debian Squeeze 64-bit box, with >> openjdk-6-jre 6b18-1.8.3-2.) >> >> I'd be happy to create a JIRA for this if that seems useful, but without a >> way to reproduce it I'm not sure that it is. >> >> Thanks, >> >> Jeremy >> >> >> >