accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3963) Incremental backoff on inability to write to HDFS
Date Fri, 14 Aug 2015 22:39:46 GMT


Josh Elser commented on ACCUMULO-3963:

I think you hit on the confusion. I believe ACCUMULO-2480 was specifically trying to work
around the case where the TServer got into a state where it couldn't recover (the underlying
cached  FileSystem got closed). The "fix" to this was having the server kill itself _any time
it couldn't talk to HDFS for a $period of time_. The fix doesn't match the problem.

If the tserver gets into a state from which it can't recover; yes, let's make sure it exits.
However, that's not the fix that was implemented. Do you agree?

> Incremental backoff on inability to write to HDFS
> -------------------------------------------------
>                 Key: ACCUMULO-3963
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.7.0
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.1, 1.8.0
> ACCUMULO-2480 added some support to kill the tserver if HDFS is unavailable after a number
of checks. ACCUMULO-3937 added some configuration values to loosen this.
> We still only sleep for a static 100ms after every failure. This makes the default 15
attempts over 10 seconds a bit misleading as it will kill itself after 1.5 seconds not 10.
> I'm thinking that this should really be more like a 30-60s wait period out of the box.
Anything less isn't really going to insulate operators from transient HDFS failures (due to
services being restarted or network partitions).

This message was sent by Atlassian JIRA

View raw message