Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: jira@apache.org
Date: Sat, 22 Aug 2015 04:18:45 +0000 (UTC)
From: "Josh Elser (JIRA)" <jira@apache.org>
To: notifications@accumulo.apache.org
Message-ID: <JIRA.12856276.1439583965000.136518.1440217125849@Atlassian.JIRA>
In-Reply-To: <JIRA.12856276.1439583965000@Atlassian.JIRA>
References: <JIRA.12856276.1439583965000@Atlassian.JIRA>
 <JIRA.12856276.1439583965678@arcas>
Subject: [jira] [Commented] (ACCUMULO-3963) Incremental backoff on inability
 to write to HDFS
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/ACCUMULO-3963?page=3Dcom.atlass=
ian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1=
4707832#comment-14707832 ]=20

Josh Elser commented on ACCUMULO-3963:
--------------------------------------

bq. I'm just trying to think about a situation where HDFS is broken, and yo=
u want accumulo to keep retrying. If there's a reasonable case for it, that=
's fine, and we can fix the retry to do an exponential back-off or somethin=
g.

The concrete case I ran into this is actually the one you described where A=
ccumulo is before HDFS is entirely healthy (via Ambari). Ambari starts HDFS=
, then starts Accumulo. HDFS takes its good old time waiting to come out of=
 safe-mode, and but ends up killing itself before that happens. I am now cu=
rious as to why SetGoalState completed before safemode was left, but that m=
ay just be an Ambari issue.

The concrete case I'm afraid of is the namenode being on the bad side of a =
network partition. Concretely, take a cluster on 10 racks and consider the =
rack that has the namenode in it flips away for some reason (bad cable, bad=
 switch, w/e). It temporarily "disappears". Considering active ingest and q=
uery, the datanodes that know where blocks are will likely continue to oper=
ate as they don't need to re-talk to the NN (I believe this is the case). H=
owever, for every tablet server that needs to open a new file (an rfile for=
 reading or a WAL for writing), it would try to talk to the NN, eventually =
get some sort of socket timeout and then retry. After the 5 or 10 tries, ea=
ch of these tservers would crash and require operator intervention.

In practice, operations may/may not notice the NN had actively been partiti=
oned away from some of the nodes. The only visible outcome off this situati=
on is that suddenly $n Accumulo tservers are dead. Ideally, in this situati=
on, someone notices that latencies on MR jobs, queries, and ingest had spik=
ed for the duration of the partition, but needed to take no additional step=
s on Accumulo. This is the 100% good "back-off requests, don't die" example=
.

The difficulty, as you already pointed out, is that we know certain cases f=
rom experience in which when operator intervention will likely be required =
anyways. So make it obvious. I don't want to call this flawed, but I think =
it's a poor goal to try to work towards. Both ZooKeeper and HDFS should be =
resilient systems in which we should be expecting them to be active. Harddr=
ives flipping to r/o are much less of an issue since the local loggers went=
 out the window. Even so, there's no guarantee that the disks that Accumulo=
 is looking at consist of all of the disks which are backing HDFS (Accumulo=
 may only run on a subset of the HDFS cluster). Ultimately, a HDD flipping =
to r/o is an HDFS reliability issue that should be addressed at the HDFS le=
vel (assuming it's not the rootfs). We cannot adequately make a decision on=
 HDFS's from Accumulo other than SafeMode, but want to rely on HDFS to move=
 away from that node (although I think tricks like HBASE-11240 might help t=
he case where DNs are slow).

I would much rather work on making Accumulo as bullet-proof as possible tha=
n making the assumption that our dependent systems are less-so (I'd be so g=
iddy if I ever felt Accumulo more reliable than ZK). I feel a huge service =
we can provide to users is removing the need to monitor failure conditions.=
 Degrade, but do not die. We're already pretty good at this -- I just think=
 we can do better.

> Incremental backoff on inability to write to HDFS
> -------------------------------------------------
>
>                 Key: ACCUMULO-3963
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3963
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.7.0
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.1, 1.8.0
>
>
> ACCUMULO-2480 added some support to kill the tserver if HDFS is unavailab=
le after a number of checks. ACCUMULO-3937 added some configuration values =
to loosen this.
> We still only sleep for a static 100ms after every failure. This makes th=
e default 15 attempts over 10 seconds a bit misleading as it will kill itse=
lf after 1.5 seconds not 10.
> I'm thinking that this should really be more like a 30-60s wait period ou=
t of the box. Anything less isn't really going to insulate operators from t=
ransient HDFS failures (due to services being restarted or network partitio=
ns).


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)