hbase-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "nkeywal (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HBASE-5843) Improve HBase MTTR - Mean Time To Recover
Date Fri, 07 Sep 2012 15:28:09 GMT

    [ https://issues.apache.org/jira/browse/HBASE-5843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450718#comment-13450718

nkeywal commented on HBASE-5843:

Some tests and analysis around Distributed Split / Datanodes failures

On a real cluster, 3 nodes.
- dfs.replication = 2
- local HD. The test failed with the ramDrive. 
- Start with 2 DN and 2 RS. Create a table with 100 regions in the second one. The first holds
meta & root.
- Insert 1M or 10M rows, distributed on all regions. That creates 8 logs files of 60Mb each,
on a single server.
- Start another box with a DN and a RS. This box is empty (no regions, no blocks).
- Unplug (physically) the box with the 100 regions and the 1 (for 1M puts) or 8 (for 10M puts)
log files.

Durations are, in seconds. With HDFS 1.0.3 if not stated differently.

1M puts on 0.94:
~180s detection time, sometimes around 150s
~130s split time (there is a single file to split. This is to be compared to the 10s per split
~180s assignment, included replaying edits. There could be some locks, as we're reassigning/replaying
50 regions per server.

1M puts on 0.96 3 tests. One failure.
~180s detection time, sometimes around 150s
~180s split time. Once again a single file to split. It's unclear why it takes longer than
~180s assignment, as 0.94.

Out of 3 tests, it failed once on 0.96. It didn't fail on 0.94.

10M puts on 0.96 + HDFS branch 2 as of today
~180s detection time, sometimes around 150s
~11 minutes split. Basically it fails until HDFS nanemode marks the datanode as dead. It takes
7:30 minutes, so the split finishes after this.
~60s assignment? Tested only once.

0M (zero) puts on 0.96 + HDFS branch 2 as of today
~180s detection time, sometimes around 150s
~0s split. 
~3s assignment (This seems to say that the assignment time is spent in the edit replay.)

10M puts on 0.96 + HDFS branch 2 + HDFS-3703 full (read + write paths)
~180s detection time, sometimes around 150s
~150s split. This for a bad reasons: all tasks excepts one succeeds. The last one seems to
connect to the dead server, and finishes after ~100s. Tested twice.
~50s assignment. Measured once.

- The measures on assignments are fishy. But it seems to say that we are now spending our
time in replaying edit. We could have issues linked to HDFS as well here: in the last two
scenarios we're not going to the dead nodes when we replay/flush edits, so that could be the
- The split in 10s per 60Gb, on a single and slow HD. With a reasonable cluster, this should
scale pretty well. We could improve things by using locality.
- There will be datanodes errors if you don't have HDFS-3703. And in this case it becomes
complicated. HBASE-6738.
- With HDFS-3703, we're 500s faster. That's interesting.
- Even with HDFS-3703 there is still something to look at in how HDFS connects to the dead
node. It seems the block is empty, so retried multiple times. There are multiple possible
paths here.
- We can expect, in production, server side point of view
   - 30s detection time for hw failure, 0s for simpler case (kill -9, OOM, machine nicely
rebooted, ...)
   - 10s split (i.e: distributed along multiple region servers)
   - 10s assignment (i.e. distributed as well).
- Without HDFS effects here. See above.
- This scenario is extreme, as we're loosing 50% of our data. Still, if you're loosing a regionserver
with 300 regions, the split can go well if you're not lucky.
- It means as well that the detection time dominates the other parameters when everything
goes well.

- Link HDFS / HBase plays the critical role in this scenario. HDFS-3703 is one of the keys.
- The Distributed Split seems to work well in terms of performances.
- Assignment itself seems ok. Replaying should be looked at (more in terms of lock than raw
- Detection time will become more an more important.
- An improvement would be to reassign the region in parallel of the split, with:
   - continue to serve writes before the end of the split as well: the fact that we're splitting
the logs does not mean we cannot write. There are real applications that could use this (may
be open tsdb for example: whatever application that logs data: they just need to know where
to write).
   - continue to server reads if there are timeranged with the max time stamp before the failure:
There are many applications that don't need fresh data (i.e. less than 1 minute old). 
- With this, the downtime will be totally dominated by the detection time.
- There are JIRAs around the detection time already (basically: improve ZK and open HBase
to external monitoring systems).
- There will be some work around the client part.
> Improve HBase MTTR - Mean Time To Recover
> -----------------------------------------
>                 Key: HBASE-5843
>                 URL: https://issues.apache.org/jira/browse/HBASE-5843
>             Project: HBase
>          Issue Type: Umbrella
>    Affects Versions: 0.96.0
>            Reporter: nkeywal
>            Assignee: nkeywal
> A part of the approach is described here: https://docs.google.com/document/d/1z03xRoZrIJmg7jsWuyKYl6zNournF_7ZHzdi0qz_B4c/edit
> The ideal target is:
> - failure impact client applications only by an added delay to execute a query, whatever
the failure.
> - this delay is always inferior to 1 second.
> We're not going to achieve that immediately...
> Priority will be given to the most frequent issues.
> Short term:
> - software crash
> - standard administrative tasks as stop/start of a cluster.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message