accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3963) Incremental backoff on inability to write to HDFS
Date Fri, 21 Aug 2015 19:00:46 GMT


Josh Elser commented on ACCUMULO-3963:

bq. Losing a key component, that already has some redundancy built in (HDFS, zookeeper), results
in a broken system

"Losing" is not well defined for the purpose of this argument. Network partitions are the
most common example.

bq. Operator headache, sure, but fail-fast is often a better solution than retry-forever.

Want to expand on why you feel this is the case? This is the fundamental point I'm arguing
against and feel is harmful.

Repeating, the point is that if a dependent component has an intermittent failure (or even
if it does require operator intervention), Accumulo itself shouldn't also require operator

> Incremental backoff on inability to write to HDFS
> -------------------------------------------------
>                 Key: ACCUMULO-3963
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.7.0
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.1, 1.8.0
> ACCUMULO-2480 added some support to kill the tserver if HDFS is unavailable after a number
of checks. ACCUMULO-3937 added some configuration values to loosen this.
> We still only sleep for a static 100ms after every failure. This makes the default 15
attempts over 10 seconds a bit misleading as it will kill itself after 1.5 seconds not 10.
> I'm thinking that this should really be more like a 30-60s wait period out of the box.
Anything less isn't really going to insulate operators from transient HDFS failures (due to
services being restarted or network partitions).

This message was sent by Atlassian JIRA

View raw message