accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3963) Incremental backoff on inability to write to HDFS
Date Fri, 14 Aug 2015 20:30:45 GMT


Josh Elser commented on ACCUMULO-3963:

Thinking about this some more, I'm reminded of [a thread|]
I read earlier this week:

I've never seen a distributed system crash altogether when a network dependency goes away;
the usual practice I've seen is to sleep and try to reconnect every few seconds.

Accumulo killing itself when there are transient failures in dependent systems (HDFS/ZK) 
is an operator's headache. It does nothing other than force operators to restart Accumulo
or implement a system to automatically restart it. Rereading ACCUMULO-2480, the original complaint
was that the TabletServer didn't exit:

because the filesytem was closed, attempts to get a new WAL continued to fail

We have essentially taken what was normally a service that would have naturally recovered
and gimped it. Seems like a step back to me. What do others think?

> Incremental backoff on inability to write to HDFS
> -------------------------------------------------
>                 Key: ACCUMULO-3963
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.7.0
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.1, 1.8.0
> ACCUMULO-2480 added some support to kill the tserver if HDFS is unavailable after a number
of checks. ACCUMULO-3937 added some configuration values to loosen this.
> We still only sleep for a static 100ms after every failure. This makes the default 15
attempts over 10 seconds a bit misleading as it will kill itself after 1.5 seconds not 10.
> I'm thinking that this should really be more like a 30-60s wait period out of the box.
Anything less isn't really going to insulate operators from transient HDFS failures (due to
services being restarted or network partitions).

This message was sent by Atlassian JIRA

View raw message