accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3963) Incremental backoff on inability to write to HDFS
Date Sat, 22 Aug 2015 04:18:45 GMT


Josh Elser commented on ACCUMULO-3963:

bq. I'm just trying to think about a situation where HDFS is broken, and you want accumulo
to keep retrying. If there's a reasonable case for it, that's fine, and we can fix the retry
to do an exponential back-off or something.

The concrete case I ran into this is actually the one you described where Accumulo is before
HDFS is entirely healthy (via Ambari). Ambari starts HDFS, then starts Accumulo. HDFS takes
its good old time waiting to come out of safe-mode, and but ends up killing itself before
that happens. I am now curious as to why SetGoalState completed before safemode was left,
but that may just be an Ambari issue.

The concrete case I'm afraid of is the namenode being on the bad side of a network partition.
Concretely, take a cluster on 10 racks and consider the rack that has the namenode in it flips
away for some reason (bad cable, bad switch, w/e). It temporarily "disappears". Considering
active ingest and query, the datanodes that know where blocks are will likely continue to
operate as they don't need to re-talk to the NN (I believe this is the case). However, for
every tablet server that needs to open a new file (an rfile for reading or a WAL for writing),
it would try to talk to the NN, eventually get some sort of socket timeout and then retry.
After the 5 or 10 tries, each of these tservers would crash and require operator intervention.

In practice, operations may/may not notice the NN had actively been partitioned away from
some of the nodes. The only visible outcome off this situation is that suddenly $n Accumulo
tservers are dead. Ideally, in this situation, someone notices that latencies on MR jobs,
queries, and ingest had spiked for the duration of the partition, but needed to take no additional
steps on Accumulo. This is the 100% good "back-off requests, don't die" example.

The difficulty, as you already pointed out, is that we know certain cases from experience
in which when operator intervention will likely be required anyways. So make it obvious. I
don't want to call this flawed, but I think it's a poor goal to try to work towards. Both
ZooKeeper and HDFS should be resilient systems in which we should be expecting them to be
active. Harddrives flipping to r/o are much less of an issue since the local loggers went
out the window. Even so, there's no guarantee that the disks that Accumulo is looking at consist
of all of the disks which are backing HDFS (Accumulo may only run on a subset of the HDFS
cluster). Ultimately, a HDD flipping to r/o is an HDFS reliability issue that should be addressed
at the HDFS level (assuming it's not the rootfs). We cannot adequately make a decision on
HDFS's from Accumulo other than SafeMode, but want to rely on HDFS to move away from that
node (although I think tricks like HBASE-11240 might help the case where DNs are slow).

I would much rather work on making Accumulo as bullet-proof as possible than making the assumption
that our dependent systems are less-so (I'd be so giddy if I ever felt Accumulo more reliable
than ZK). I feel a huge service we can provide to users is removing the need to monitor failure
conditions. Degrade, but do not die. We're already pretty good at this -- I just think we
can do better.

> Incremental backoff on inability to write to HDFS
> -------------------------------------------------
>                 Key: ACCUMULO-3963
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.7.0
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.1, 1.8.0
> ACCUMULO-2480 added some support to kill the tserver if HDFS is unavailable after a number
of checks. ACCUMULO-3937 added some configuration values to loosen this.
> We still only sleep for a static 100ms after every failure. This makes the default 15
attempts over 10 seconds a bit misleading as it will kill itself after 1.5 seconds not 10.
> I'm thinking that this should really be more like a 30-60s wait period out of the box.
Anything less isn't really going to insulate operators from transient HDFS failures (due to
services being restarted or network partitions).

This message was sent by Atlassian JIRA

View raw message