hbase-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Drzal <mdr...@gmail.com>
Subject Re: mttr update
Date Sun, 02 Sep 2012 11:26:03 GMT
The other way to simulate a complete machine failure is to null route the
world.  As long as you have access to the console of the machine, you can
just set your default route to 127.0.0.1, and you will have effectively
stopped all communication.

On Tue, Aug 21, 2012 at 9:46 AM, N Keywal <nkeywal@gmail.com> wrote:

> Hi all,
>
> Just an update on the work in progress for the MTTR, looking at HDFS
> failures impact on HBase. There are some comments in HBASE-5843 for
> the pure HBase part.
>
> 1) Single node failure: a region server and its datanode.
> I think the main global issue is the one mentionned in HDFS-3703:
> currently HBase starts its recovery before HDFS identifies a node as
> dead. When we have a dead region servers, the regions will be assigned
> to another one. In other words:
> - the HLog will be read to be split & replayed
> - new files will be written (i.e. new blocks allocated) as the result
> of this split
> - if the data locality of your cluster is good, you've just lost one
> third of the replica of the hfile for the region you're migrating.
>
> By default, HBase recovery starts after 3 minutes (should be lower in
> production, likely 1 minute), while HDFS marks the datanode as dead
> after 10 minutes 30.
>
> But in the meantime, HDFS still return the replica of the dead
> datanode as valid, and propose the dead datanode as an option for new
> writes.
> So:
> - you have delays and errors when splitting the hlog
> - reading the hfile from the region servers receving the region will
> get read errors as well (33% of the replica are missing).
> - you will have write errors as well. For example, with a dead
> datanode, out of 20 machines, with a replication of 3, you will be
> directed to this dead datanode 15% of the time. Per block. With 100
> machines, it's just 3%, still per block. But with 100 blocks to write,
> you will get a write error 95% of the time. This error will be
> recovered, but will slow the recovery by another minute if you were
> creating new hlog files.
>
> So the recovery, for the client, if everything goes well (i.e. we're
> reasonably lucky) will be around:
> 1 minute (hbase failure detection time)
> 1 minute (timeout when reading the hlog to split
> 1 minute (timeout when writing the new hlogs)
> 1 minute (timeout when reading the old hfiles and getting)
>
> It can be less if the datanodes were not already holding tcp
> connections with the dead nodes, as the connect timeout is 20s.
>
> The last 3 steps disappear with HDFS-3703. The partial workarounds are
> HBASE-6435 and HDFS-3705.
> Today, even with a single node failure, you can have HDFS-3701 /
> HBASE-6401 as well.
>
> 3) Large failure (multiple nodes)
> 3.1) If we've lost all the replica for a block, well, we're in a bad
> shape. Not really studied, but some logs are documented in HBASE-6626.
> 3.2) For HLog, we monitor the number of replica, and open a new file
> if there are missing replicas. After 5 attempts, we continue with a
> single replica, the one on the same machine as its region server,
> maximizing the risk. That's why I tend to think HDFS-3702 could be
> necessary; but I haven't found another use case than a write-ahead-log
> from an HDFS point of view. HDFS-3703 lowers the probability to be
> directed multiple times to bad datanodes.
> 3.3) Impact on memstore flush: not studied yet.
> 3.4) Impact of HA namenode switch: I guess it's a part of namenode HA.
> Not studied.
>
> 3) Testing
> HW Failures are difficult to simulate. When you kill -9 a process, the
> sockets are closed by the operating system, and the client will be
> notified. It's not the same thing as a really dead machine, that won't
> reply to not send ip packets. The only "simple" way I found to test
> this is to unplug the network cable. I haven't tried with
> virtualization, but we don't have anything within HBase minicluster to
> do a proper test (would require specific hooks I think, if it is even
> possible). I don't know if they have something in HDFS.
>
> That's all folks :-)
>
> N.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message