Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 625DF18247 for ; Sat, 22 Aug 2015 04:18:46 +0000 (UTC) Received: (qmail 94894 invoked by uid 500); 22 Aug 2015 04:18:46 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 94830 invoked by uid 500); 22 Aug 2015 04:18:46 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 94813 invoked by uid 99); 22 Aug 2015 04:18:45 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 22 Aug 2015 04:18:45 +0000 Date: Sat, 22 Aug 2015 04:18:45 +0000 (UTC) From: "Josh Elser (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-3963) Incremental backoff on inability to write to HDFS MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-3963?page=3Dcom.atlass= ian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1= 4707832#comment-14707832 ]=20 Josh Elser commented on ACCUMULO-3963: -------------------------------------- bq. I'm just trying to think about a situation where HDFS is broken, and yo= u want accumulo to keep retrying. If there's a reasonable case for it, that= 's fine, and we can fix the retry to do an exponential back-off or somethin= g. The concrete case I ran into this is actually the one you described where A= ccumulo is before HDFS is entirely healthy (via Ambari). Ambari starts HDFS= , then starts Accumulo. HDFS takes its good old time waiting to come out of= safe-mode, and but ends up killing itself before that happens. I am now cu= rious as to why SetGoalState completed before safemode was left, but that m= ay just be an Ambari issue. The concrete case I'm afraid of is the namenode being on the bad side of a = network partition. Concretely, take a cluster on 10 racks and consider the = rack that has the namenode in it flips away for some reason (bad cable, bad= switch, w/e). It temporarily "disappears". Considering active ingest and q= uery, the datanodes that know where blocks are will likely continue to oper= ate as they don't need to re-talk to the NN (I believe this is the case). H= owever, for every tablet server that needs to open a new file (an rfile for= reading or a WAL for writing), it would try to talk to the NN, eventually = get some sort of socket timeout and then retry. After the 5 or 10 tries, ea= ch of these tservers would crash and require operator intervention. In practice, operations may/may not notice the NN had actively been partiti= oned away from some of the nodes. The only visible outcome off this situati= on is that suddenly $n Accumulo tservers are dead. Ideally, in this situati= on, someone notices that latencies on MR jobs, queries, and ingest had spik= ed for the duration of the partition, but needed to take no additional step= s on Accumulo. This is the 100% good "back-off requests, don't die" example= . The difficulty, as you already pointed out, is that we know certain cases f= rom experience in which when operator intervention will likely be required = anyways. So make it obvious. I don't want to call this flawed, but I think = it's a poor goal to try to work towards. Both ZooKeeper and HDFS should be = resilient systems in which we should be expecting them to be active. Harddr= ives flipping to r/o are much less of an issue since the local loggers went= out the window. Even so, there's no guarantee that the disks that Accumulo= is looking at consist of all of the disks which are backing HDFS (Accumulo= may only run on a subset of the HDFS cluster). Ultimately, a HDD flipping = to r/o is an HDFS reliability issue that should be addressed at the HDFS le= vel (assuming it's not the rootfs). We cannot adequately make a decision on= HDFS's from Accumulo other than SafeMode, but want to rely on HDFS to move= away from that node (although I think tricks like HBASE-11240 might help t= he case where DNs are slow). I would much rather work on making Accumulo as bullet-proof as possible tha= n making the assumption that our dependent systems are less-so (I'd be so g= iddy if I ever felt Accumulo more reliable than ZK). I feel a huge service = we can provide to users is removing the need to monitor failure conditions.= Degrade, but do not die. We're already pretty good at this -- I just think= we can do better. > Incremental backoff on inability to write to HDFS > ------------------------------------------------- > > Key: ACCUMULO-3963 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3963 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.7.0 > Reporter: Josh Elser > Assignee: Josh Elser > Priority: Critical > Fix For: 1.7.1, 1.8.0 > > > ACCUMULO-2480 added some support to kill the tserver if HDFS is unavailab= le after a number of checks. ACCUMULO-3937 added some configuration values = to loosen this. > We still only sleep for a static 100ms after every failure. This makes th= e default 15 attempts over 10 seconds a bit misleading as it will kill itse= lf after 1.5 seconds not 10. > I'm thinking that this should really be more like a 30-60s wait period ou= t of the box. Anything less isn't really going to insulate operators from t= ransient HDFS failures (due to services being restarted or network partitio= ns). -- This message was sent by Atlassian JIRA (v6.3.4#6332)