Return-Path: Delivered-To: apmail-hadoop-hbase-dev-archive@minotaur.apache.org Received: (qmail 70176 invoked from network); 7 Aug 2009 16:55:20 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Aug 2009 16:55:20 -0000 Received: (qmail 27605 invoked by uid 500); 7 Aug 2009 16:55:27 -0000 Delivered-To: apmail-hadoop-hbase-dev-archive@hadoop.apache.org Received: (qmail 27570 invoked by uid 500); 7 Aug 2009 16:55:27 -0000 Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hbase-dev@hadoop.apache.org Delivered-To: mailing list hbase-dev@hadoop.apache.org Received: (qmail 27560 invoked by uid 99); 7 Aug 2009 16:55:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Aug 2009 16:55:27 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates 209.85.221.188 as permitted sender) Received: from [209.85.221.188] (HELO mail-qy0-f188.google.com) (209.85.221.188) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 07 Aug 2009 16:55:15 +0000 Received: by qyk26 with SMTP id 26so1615279qyk.5 for ; Fri, 07 Aug 2009 09:54:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:date :x-google-sender-auth:message-id:subject:from:to:content-type; bh=9qK1WRVvkK1Vs9Ekr//HD2EXGMooK7XqdVXl6xG6IEI=; b=DFI0W99la7oHT9TMbGRhGOtiJW8EVUTiIJa+P/h1h5ZZlpSVeiE9xvBlwVVmr53yYV C3cMXF+coXLTHdoYNkdFJ9HZTn4+fgf7TB+WtEPqDT1iLK8XtRYaIO5/S2kdQkHQ0v4m urd05/e9GhLY3EgmTIuML2LnFBINbmN0GmbmM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; b=W7DpfODirrtU0Imj9F633mSTmb5SvRjUSluv8rv+qgK+fgsIz6gi432XWMEgyneS8P Y1Gi7yD2SzeB86zmO37Mnvmzsk7mkrvQHE0vBmNdS5fZ4OgK+peZuoVyjaDQ0HlhH3bU uKt1VU6HgYwLiZRBqLr7FKqL1VNa34UnZpXO8= MIME-Version: 1.0 Sender: saint.ack@gmail.com Received: by 10.229.85.143 with SMTP id o15mr797423qcl.1.1249664093993; Fri, 07 Aug 2009 09:54:53 -0700 (PDT) Date: Fri, 7 Aug 2009 09:54:53 -0700 X-Google-Sender-Auth: 0d7273f1664476dd Message-ID: <7c962aed0908070954yccfd684v7d59f042375096e1@mail.gmail.com> Subject: append (hadoop-4379), was -> Re: roadmap: data integrity From: stack To: hbase-dev@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016364ef3d2693d07047090184d X-Virus-Checked: Checked by ClamAV on apache.org --0016364ef3d2693d07047090184d Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Here is a quick note on the current state of my testing of HADOOP-4379 (support for 'append' in hadoop 0.20.x). On my small test cluster, I am not able to break the latest patch posted by Dhruba under heavy-loadings. It seems to basically work. On regionserver crash, the master runs log split and when it comes to the last in the set of regionserver logs for splitting, the one that is inevitably unclosed because the process crashed, we are able to recover most edits in this last file (in my testing, it seemed to be all edits up to the last flush of the regionserver process). The upshot is that tentatively, we may have a "working" append in the 0.20 timeframe (In 0.21, we should have https://issues.apache.org/jira/browse/HDFS-265). I'll keep testing but I'd suggest its time for others to try out. With HADOOP-4379, the process recovering non-closed log files -- the master in our case -- must successfully open the file in append mode and then close it. Once closed, new readers can purportedly see up to the last flush. The open to append can take a little while before it will go through (Complaint is that another process holds the files' lease). Meantime, the opening for append process must retry. In my experience its taking 2-10 seconds. Support for appends is off by default in hadoop even after HADOOP-4379 has been applied. To enable, you need to set dfs.support.append. Set it everywhere -- all over hadoop and in hbase-site.xml so hbase/DFSClient can see the attribute. HBase TRUNK will recognize if the bundled hadoop supports append via introspection (SequenceFile has a new syncFs method when HADOOP-4379 has been applied). If an append-supporting hadoop is present, and dfs.support.append is set in hbase context, then hbase when its running HLog#splitLog will try to opening files to append. On regionserver crash, you can see the master HLog#splitLog loop retrying the open for append until it is successful (You'll see in the master log complaint that lease on the file is held by another process). We retry every second. Successful recovery of all edits is uncovering new, interesting issues. In my testing I was killing regionserver only but also killing regionserver and datanode. In latter case, what I would see is that namenode would continue to assign the dead namenode work at least until its lease expired. Fair enough says you, only the datanode lease is ten minutes by default. I set it down in my tests using heartbeat.recheck.interval (There is a pregnant comment in HADOOP-4379 w/ clientside code where Ruyue Ma says they get around this issue by having client pass the namenode the datanodes it knows dead when asking for an extra block). We might want to recommend setting it down in general. Other issues are hbase bugs we see when edits all recovered. I've been filing issues on these over last few days. St.Ack On Fri, Aug 7, 2009 at 9:03 AM, Andrew Purtell wrote: > Good to see there's direct edit replication support; that can make > things easier. > > I've seen people use DRDB or NFS to replicate edits currently. > > Namenode failover is a "solvable" issue with traditional HA: OS level > heartbeats, fencing, fail over -- e.g. HA infrastructure daemon starts > NN instance on node B if heartbeat from node A is lost and takes a > power control operation on A to make sure it is dead. On both nodes the > infastructure daemons trigger the OS watchdog if the NN process dies. > Combine this with automatic IP address reassignment. Then, page the > operators. Add another node C for additional redundancy, and make sure > all of the alternatives are on separate racks and power rails, and make > sure the L2 and L3 topology is also HA (e.g. bonded ethernet to > redundant switches at L2, mesh routing at L3, etc.) If the cluster is > not super huge it can all be spanned at L2 over redundant switches. L3 > redundancy is tricker. A typical configuration could have a lot of OSPF > stub networks -- depends how L2 is partitoned -- which can make the > routing table difficult for operators to sort out. > > I've seen this type of thing work for myself, ~15 seconds from > (simulated) fault on NN node A to the new NN up and responding to DN > reconnections on node B, with 0.19. > > You can build in additional assurance of fast failover by building > redundant processes to run concurrently with a few datanodes which over > and over ping the NN via the namenode protocol and trigger fencing and > failover if it stops responding. > > One wrinkle is the new namenode starts up in safe mode. As long as > HBase can handle temporary periods where the cluster goes into > safemode after NN fail over, it can ride it out. > > This is ugly, but this is I believe an accepted and valid systems > engineering solution for the NN SPOF issue for the folks I mentioned > in my previous email, something they would be familiar with. Edit > replication support in HDFS 0.21 makes it a little less work to > achieve and maybe a little faster to execute, so that's an > improvement. > > It may be overstating it a little bit to say that the NN SPOF is not a > concern for HBase, but, in my opinion, we need to address WAL and > (lack of FSCK) issues first before being concerned about it. HBase can > lose data all on its own. > > - Andy > > > > > > ________________________________ > From: Jean-Daniel Cryans > To: hbase-dev@hadoop.apache.org > Sent: Friday, August 7, 2009 3:25:19 AM > Subject: Re: roadmap: data integrity > > https://issues.apache.org/jira/browse/HADOOP-4539 > > This issue was closed long ago. But, Steve Loughran just said on tha > hadoop mailing list that the new NN has to come up with the same > IP/hostname as the failed one. > > J-D > > On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson wrote: > > WAL is a major issue, but another one that is coming up fast is the > > SPOF that is the namenode. > > > > Right now, namenode aside, I can rolling restart my entire cluster, > > including rebooting the machines if I needed to. But not so with the > > namenode, because if it does AWOL, all sorts of bad can happen. > > > > I hope that HDFS 0.21 addresses both these issues. Can we get > > positive confirmation that this is being worked on? > > > > -ryan > > > > On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell > wrote: > >> I updated the roadmap up on the wiki: > >> > >> > >> * Data integrity > >> * Insure that proper append() support in HDFS actually closes the > >> WAL last block write hole > >> * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21 > >> > >> I have had several recent conversations on my travels with people in > >> Fortune 100 companies (based on this list: > >> http://www.wageproject.org/content/fortune/index.php). > >> > >> You and I know we can set up well engineered HBase 0.20 clusters that > >> will be operationally solid for a wide range of use cases, but given > >> those aforementioned discussions there are certain sectors which would > >> say HBASE-7 is #1 before HBase is "bank ready". Not until we can say: > >> > >> - Yes, when the client sees data has been committed, it actually has > >> been written and replicated on spinning or solid state media in all > >> cases. > >> > >> - Yes, we go to great lengths to recover data if ${deity} forbid you > >> crush some underprovisioned cluster with load or some bizarre bug or > >> system fault happens. > >> > >> HBASE-1295 is also required for business continuity reasons, but this > >> is already a priority item for some HBase committers. > >> > >> The question is I think does the above align with project goals. > >> Making HBase-FSCK a blocker will probably knock something someone > >> wants for the 0.21 timeframe off the list. > >> > >> - Andy > >> > >> > >> > > > > > > > --0016364ef3d2693d07047090184d--