accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3963) Incremental backoff on inability to write to HDFS
Date Sat, 31 Oct 2015 18:34:27 GMT


Josh Elser commented on ACCUMULO-3963:

[~kturner] helped me out last night by reviewing some changes I was thinking about. Summarizing
what we found:

* The current implementation using a Cache is potentially inaccurate as {{size()}} is not
guaranteed to be accurate with the time-based eviction.
* 1.7.0 had the hard-coded tserver killing, the config properties currently in place were
added after 1.7.0 was released (so there is not concern about removing them).
* It would be nice if we could encapsulate a standard set of retry-related values into a single
configuration property. e.g. the knobs we presently have for the Retry class (used by our
ZK code) are iterations, initial wait, wait increment, and maximum wait. Concretely, retry
25 times, wait 1s the first time, 500ms more each time after, but wait no longer than 5s total.
Instead of exposing multiple properties for these values, we could encapsulate them in one
property which would reduce the configuration burden.

> Incremental backoff on inability to write to HDFS
> -------------------------------------------------
>                 Key: ACCUMULO-3963
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.7.0
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Blocker
>             Fix For: 1.7.1, 1.8.0
> ACCUMULO-2480 added some support to kill the tserver if HDFS is unavailable after a number
of checks. ACCUMULO-3937 added some configuration values to loosen this.
> We still only sleep for a static 100ms after every failure. This makes the default 15
attempts over 10 seconds a bit misleading as it will kill itself after 1.5 seconds not 10.
> I'm thinking that this should really be more like a 30-60s wait period out of the box.
Anything less isn't really going to insulate operators from transient HDFS failures (due to
services being restarted or network partitions).

This message was sent by Atlassian JIRA

View raw message