Mailing-List: contact hbase-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hbase-dev@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of saint.ack@gmail.com designates
 209.85.221.188 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:sender:date:x-google-sender-auth:message-id:subject
         :from:to:content-type;
        b=W7DpfODirrtU0Imj9F633mSTmb5SvRjUSluv8rv+qgK+fgsIz6gi432XWMEgyneS8P
         Y1Gi7yD2SzeB86zmO37Mnvmzsk7mkrvQHE0vBmNdS5fZ4OgK+peZuoVyjaDQ0HlhH3bU
         uKt1VU6HgYwLiZRBqLr7FKqL1VNa34UnZpXO8=
MIME-Version: 1.0
Sender: saint.ack@gmail.com
Date: Fri, 7 Aug 2009 09:54:53 -0700
Message-ID: <7c962aed0908070954yccfd684v7d59f042375096e1@mail.gmail.com>
Subject: append (hadoop-4379), was -> Re: roadmap: data integrity
From: stack <stack@duboce.net>
To: hbase-dev@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0016364ef3d2693d07047090184d

--0016364ef3d2693d07047090184d
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

Here is a quick note on the current state of my testing of HADOOP-4379
(support for 'append' in hadoop 0.20.x).

On my small test cluster, I am not able to break the latest patch posted by
Dhruba under heavy-loadings.  It seems to basically work.  On regionserver
crash, the master runs log split and when it comes to the last in the set of
regionserver logs for splitting, the one that is inevitably unclosed because
the process crashed, we are able to recover most edits in this last file (in
my testing, it seemed to be all edits up to the last flush of the
regionserver process).

The upshot is that tentatively, we may have a "working" append in the 0.20
timeframe (In 0.21, we should have
https://issues.apache.org/jira/browse/HDFS-265).  I'll keep testing but I'd
suggest its time for others to try out.

With HADOOP-4379, the process recovering non-closed log files -- the master
in our case -- must successfully open the file in append mode and then close
it.  Once closed, new readers can purportedly see up to the last flush.  The
open to append can take a little while before it will go through (Complaint
is that another process holds the files' lease).  Meantime, the opening for
append process must retry.   In my experience its taking 2-10 seconds.

Support for appends is off by default in hadoop even after HADOOP-4379 has
been applied.  To enable, you need to set dfs.support.append.   Set it
everywhere -- all over hadoop and in hbase-site.xml so hbase/DFSClient can
see the attribute.

HBase TRUNK will recognize if the bundled hadoop supports append via
introspection (SequenceFile has a new syncFs method when HADOOP-4379 has
been applied).   If an append-supporting hadoop is present, and
dfs.support.append is set in hbase context, then hbase when its running
HLog#splitLog will try to opening files to append.  On regionserver crash,
you can see the master HLog#splitLog loop retrying the open for append until
it is successful (You'll see in the master log complaint that lease on the
file is held by another process).  We retry every second.

Successful recovery of all edits is uncovering new, interesting issues.  In
my testing I was killing regionserver only but also killing regionserver and
datanode.  In latter case, what I would see is that namenode would continue
to assign the dead namenode work at least until its lease expired.  Fair
enough says you, only the datanode lease is ten minutes by default.  I set
it down in my tests using heartbeat.recheck.interval (There is a pregnant
comment in HADOOP-4379 w/ clientside code where Ruyue Ma says they get
around this issue by having client pass the namenode the datanodes it knows
dead when asking for an extra block).  We might want to recommend setting it
down in general.

Other issues are hbase bugs we see when edits all recovered.  I've been
filing issues on these over last few days.

St.Ack


On Fri, Aug 7, 2009 at 9:03 AM, Andrew Purtell <apurtell@apache.org> wrote:

> Good to see there's direct edit replication support; that can make
> things easier.
>
> I've seen people use DRDB or NFS to replicate edits currently.
>
> Namenode failover is a "solvable" issue with traditional HA: OS level
> heartbeats, fencing, fail over -- e.g. HA infrastructure daemon starts
> NN instance on node B if heartbeat from node A is lost and takes a
> power control operation on A to make sure it is dead. On both nodes the
> infastructure daemons trigger the OS watchdog if the NN process dies.
> Combine this with automatic IP address reassignment. Then, page the
> operators. Add another node C for additional redundancy, and make sure
> all of the alternatives are on separate racks and power rails, and make
> sure the L2 and L3 topology is also HA (e.g. bonded ethernet to
> redundant switches at L2, mesh routing at L3, etc.) If the cluster is
> not super huge it can all be spanned at L2 over redundant switches. L3
> redundancy is tricker. A typical configuration could have a lot of OSPF
> stub networks -- depends how L2 is partitoned -- which can make the
> routing table difficult for operators to sort out.
>
> I've seen this type of thing work for myself, ~15 seconds from
> (simulated) fault on NN node A to the new NN up and responding to DN
> reconnections on node B, with 0.19.
>
> You can build in additional assurance of fast failover by building
> redundant processes to run concurrently with a few datanodes which over
> and over ping the NN via the namenode protocol and trigger fencing and
> failover if it stops responding.
>
> One wrinkle is the new namenode starts up in safe mode. As long as
> HBase can handle temporary periods where the cluster goes into
> safemode after NN fail over, it can ride it out.
>
> This is ugly, but this is I believe an accepted and valid systems
> engineering solution for the NN SPOF issue for the folks I mentioned
> in my previous email, something they would be familiar with. Edit
> replication support in HDFS 0.21 makes it a little less work to
> achieve and maybe a little faster to execute, so that's an
> improvement.
>
> It may be overstating it a little bit to say that the NN SPOF is not a
> concern for HBase, but, in my opinion, we need to address WAL and
> (lack of FSCK) issues first before being concerned about it. HBase can
> lose data all on its own.
>
>   - Andy
>
>
>
>
>
> ________________________________
> From: Jean-Daniel Cryans <jdcryans@apache.org>
> To: hbase-dev@hadoop.apache.org
> Sent: Friday, August 7, 2009 3:25:19 AM
> Subject: Re: roadmap: data integrity
>
> https://issues.apache.org/jira/browse/HADOOP-4539
>
> This issue was closed long ago. But, Steve Loughran just said on tha
> hadoop mailing list that the new NN has to come up with the same
> IP/hostname as the failed one.
>
> J-D
>
> On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<ryanobjc@gmail.com> wrote:
> > WAL is a major issue, but another one that is coming up fast is the
> > SPOF that is the namenode.
> >
> > Right now, namenode aside, I can rolling restart my entire cluster,
> > including rebooting the machines if I needed to. But not so with the
> > namenode, because if it does AWOL, all sorts of bad can happen.
> >
> > I hope that HDFS 0.21 addresses both these issues.  Can we get
> > positive confirmation that this is being worked on?
> >
> > -ryan
> >
> > On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<apurtell@apache.org>
> wrote:
> >> I updated the roadmap up on the wiki:
> >>
> >>
> >> * Data integrity
> >>    * Insure that proper append() support in HDFS actually closes the
> >>      WAL last block write hole
> >>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21
> >>
> >> I have had several recent conversations on my travels with people in
> >> Fortune 100 companies (based on this list:
> >> http://www.wageproject.org/content/fortune/index.php).
> >>
> >> You and I know we can set up well engineered HBase 0.20 clusters that
> >> will be operationally solid for a wide range of use cases, but given
> >> those aforementioned discussions there are certain sectors which would
> >> say HBASE-7 is #1 before HBase is "bank ready". Not until we can say:
> >>
> >>  - Yes, when the client sees data has been committed, it actually has
> >> been written and replicated on spinning or solid state media in all
> >> cases.
> >>
> >>  - Yes, we go to great lengths to recover data if ${deity} forbid you
> >> crush some underprovisioned cluster with load or some bizarre bug or
> >> system fault happens.
> >>
> >> HBASE-1295 is also required for business continuity reasons, but this
> >> is already a priority item for some HBase committers.
> >>
> >> The question is I think does the above align with project goals.
> >> Making HBase-FSCK a blocker will probably knock something someone
> >> wants for the 0.21 timeframe off the list.
> >>
> >>   - Andy
> >>
> >>
> >>
> >
>
>
>
>
>

--0016364ef3d2693d07047090184d--