accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <>
Subject [jira] [Resolved] (ACCUMULO-2480) ha fail-failover failure
Date Mon, 29 Sep 2014 13:38:33 GMT


Eric Newton resolved ACCUMULO-2480.
    Resolution: Fixed

> ha fail-failover failure
> ------------------------
>                 Key: ACCUMULO-2480
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: master, tserver
>         Environment: running continuous ingest on a 74-node HA NN hadoop 2.3 cluster,
>            Reporter: Eric Newton
>            Assignee: Eric Newton
>             Fix For: 1.7.0
>          Time Spent: 10m
>  Remaining Estimate: 0h
> Ran {{service network stop}} on the active NN.  The service failed to switch over since
the fencing script on the standby failed to run (sshfence).
> After the network interface was re-established, the standby took over.
> However, accumulo ingest began to have very long hold times since the standby was not
providing service for several minutes.
> The master attempted to shutdown the tablet servers with hold time.
> The filesystem hook closed the filesystem, and the servers got stuck endlessly trying
to write to the WAL.
> Even after the NN was active, because the filesytem was closed, attempts to get a new
WAL continued to fail.
> * why didn't the tablet servers stop?
> * WAL loop should be able to terminate if they see an IOException that indicates that
the filesystem is closed

This message was sent by Atlassian JIRA

View raw message