accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <>
Subject [jira] [Created] (ACCUMULO-2480) ha fail-failover failure
Date Fri, 14 Mar 2014 20:08:43 GMT
Eric Newton created ACCUMULO-2480:

             Summary: ha fail-failover failure
                 Key: ACCUMULO-2480
             Project: Accumulo
          Issue Type: Bug
          Components: master, tserver
         Environment: running continuous ingest on a 74-node HA NN hadoop 2.3 cluster, 1.6.0-SNAPSHOT.
            Reporter: Eric Newton

Ran {{service network stop}} on the active NN.  The service failed to switch over since the
fencing script on the standby failed to run (sshfence).

After the network interface was re-established, the standby took over.

However, accumulo ingest began to have very long hold times since the standby was not providing
service for several minutes.

The master attempted to shutdown the tablet servers with hold time.

The filesystem hook closed the filesystem, and the servers got stuck endlessly trying to write
to the WAL.

Even after the NN was active, because the filesytem was closed, attempts to get a new WAL
continued to fail.

* why didn't the tablet servers stop?
* WAL loop should be able to terminate if they see an IOException that indicates that the
filesystem is closed

This message was sent by Atlassian JIRA

View raw message