Return-Path: X-Original-To: apmail-hbase-user-archive@www.apache.org Delivered-To: apmail-hbase-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 09D7672C9 for ; Fri, 5 Aug 2011 04:02:29 +0000 (UTC) Received: (qmail 3999 invoked by uid 500); 5 Aug 2011 04:02:26 -0000 Delivered-To: apmail-hbase-user-archive@hbase.apache.org Received: (qmail 3740 invoked by uid 500); 5 Aug 2011 04:02:21 -0000 Mailing-List: contact user-help@hbase.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hbase.apache.org Delivered-To: mailing list user@hbase.apache.org Received: (qmail 3732 invoked by uid 99); 5 Aug 2011 04:02:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2011 04:02:18 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of todd@cloudera.com designates 209.85.213.41 as permitted sender) Received: from [209.85.213.41] (HELO mail-yw0-f41.google.com) (209.85.213.41) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 05 Aug 2011 04:02:11 +0000 Received: by ywa6 with SMTP id 6so1906692ywa.14 for ; Thu, 04 Aug 2011 21:01:50 -0700 (PDT) Received: by 10.101.112.6 with SMTP id p6mr1411924anm.134.1312516910126; Thu, 04 Aug 2011 21:01:50 -0700 (PDT) MIME-Version: 1.0 Received: by 10.100.163.5 with HTTP; Thu, 4 Aug 2011 21:01:29 -0700 (PDT) In-Reply-To: References: From: Todd Lipcon Date: Thu, 4 Aug 2011 21:01:29 -0700 Message-ID: Subject: Re: Apparent data loss on 90.4 rc2 after partial zookeeper network partition (on MapR) To: user@hbase.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Thu, Aug 4, 2011 at 8:36 PM, lohit wrote: > 2011/8/4 Ryan Rawson > >> Yes, that is what JD is referring to, the so-called IO fence. >> >> It works like so: >> - regionserver is appending to an HLog, continues to do so, hasnt >> gotten the ZK "kill yourself signal" yet >> - hmaster splits the logs >> - the hmaster yanks the writer from under the regionserver, and the RS >> then starts to kill itself >> > Can you tell more about how this is done with HDFS. If RS has the lease, = how > did master get hold of that lease. Or is it removing file? In older versions, it would call append() which recovered the lease, so long as the soft lease timeout had expired. More recently, it calls an HDFS "recoverLease" API that provides fencing. >> >> >> This can happen because ZK can deliver the session lost message late, >> and there is a race. >> >> -ryan >> >> On Thu, Aug 4, 2011 at 8:13 PM, M. C. Srivas wrote: >> > On Thu, Aug 4, 2011 at 10:34 AM, Jean-Daniel Cryans > >wrote: >> > >> >> > Thanks for the feedback. =A0So you're inclined to think it would be= at >> the >> >> dfs >> >> > layer? >> >> >> >> That's where the evidence seems to point. >> >> >> >> > >> >> > Is it accurate to say the most likely places where the data could h= ave >> >> been >> >> > lost were: >> >> > 1. wal writes didn't actually get written to disk (no log entries t= o >> >> suggest >> >> > any issues) >> >> >> >> Most likely. >> >> >> >> > 2. wal corrupted (no log entries suggest any trouble reading the lo= g) >> >> >> >> In that case the logs would scream (and I didn't see that in the logs >> >> I looked at). >> >> >> >> > 3. not all split logs were read by regionservers =A0(?? is there an= y way >> to >> >> > ensure this either way... should I look at the filesystem some plac= e?) >> >> >> >> Some regions would have recovered edits files, but that seems highly >> >> unlikely. With DEBUG enabled we could have seen which files were spli= t >> >> by the master and which ones were created for the regions, and then >> >> which were read by the region servers. >> >> >> >> > >> >> > Do you think the type of network partition I'm talking about is >> >> adequately >> >> > covered in existing tests? (Specifically running an external zk >> cluster?) >> >> >> >> The IO fencing was only tested with HDFS, I don't know what happens i= n >> >> that case with MapR. What I mean is that when the master splits the >> >> logs, it takes ownership of the HDFS writer lease (only one per file) >> >> so that it can safely close the log file. Then after that it checks i= f >> >> there are any new log files that were created (the region server coul= d >> >> have rolled a log while the master was splitting them) and will >> >> restart if that situation happens until it's able to own all files an= d >> >> split them. >> >> >> > >> > JD, =A0 I didn't think the master explicitly dealt with writer leases. >> > >> > Does HBase rely on single-writer semantics on the log file? That is, i= f >> the >> > master and a RS both decide to mucky-muck with a log file, you expect = the >> FS >> > to lock out one of the writers? >> > >> > >> > >> > >> >> >> >> > >> >> > Have you heard if anyone else is been having problems with the seco= nd >> >> 90.4 >> >> > rc? >> >> >> >> Nope, we run it here on our dev cluster and didn't encounter any issu= e >> >> (with the code or node failure). >> >> >> >> > >> >> > Thanks again for your help. =A0I'm following up with the MapR guys = as >> well. >> >> >> >> Good idea! >> >> >> >> J-D >> >> >> > >> > > > > -- > Have a Nice Day! > Lohit > --=20 Todd Lipcon Software Engineer, Cloudera