accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3963) Incremental backoff on inability to write to HDFS
Date Fri, 21 Aug 2015 20:40:45 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14707422#comment-14707422
] 

Eric Newton commented on ACCUMULO-3963:
---------------------------------------

First, I don't have terribly strong feelings about fail-fast. I can be easily convinced otherwise.

Consider the use-case where a file-system goes read-only on one node. Of course now we monitor
this, but before we did this, it was really annoying.  The writes would fail, then retry on
other nodes, causing long write delays.  I would have preferred fail-fast, even if it wiped
out the whole node.

However, there's a case were we do retry already built into accumulo. It will wait for HDFS
and zookeeper to come up before starting.  We did this so that services could be started in
any random order and everything would just work.

I'm just trying to think about a situation where HDFS is broken, and you want accumulo to
keep retrying. If there's a reasonable case for it, that's fine, and we can fix the retry
to do an exponential back-off or something.

> Incremental backoff on inability to write to HDFS
> -------------------------------------------------
>
>                 Key: ACCUMULO-3963
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3963
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.7.0
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.1, 1.8.0
>
>
> ACCUMULO-2480 added some support to kill the tserver if HDFS is unavailable after a number
of checks. ACCUMULO-3937 added some configuration values to loosen this.
> We still only sleep for a static 100ms after every failure. This makes the default 15
attempts over 10 seconds a bit misleading as it will kill itself after 1.5 seconds not 10.
> I'm thinking that this should really be more like a 30-60s wait period out of the box.
Anything less isn't really going to insulate operators from transient HDFS failures (due to
services being restarted or network partitions).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message