accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Created] (ACCUMULO-2053) Slow reassignment after failure and recovery
Date Wed, 18 Dec 2013 05:21:06 GMT
Josh Elser created ACCUMULO-2053:

             Summary: Slow reassignment after failure and recovery
                 Key: ACCUMULO-2053
             Project: Accumulo
          Issue Type: Improvement
          Components: master
         Environment: 5bb28edb with Hadoop 2.2.0
            Reporter: Josh Elser

Running CI, I noticed the following situation. Agitation killed a tabletserver. Recovery was
performed, but the tablets were not yet reassigned as reported by the monitor. A minute had
gone by and there were still a significant number of tablets (~15 out of 150) still offline
for a single table. One at a time, the tablets went from unassigned to assigned.

Tail'ing the master log, this was confirmed, as I saw the following lines repeated for every
offline tablet:

2013-12-17 21:10:52,615 [recovery.RecoveryManager] DEBUG: Recovering hdfs://nameservice/accumulo/wal/tserver1+9997/0a60966c-b72d-4643-bf39-3fbfec342cc0
to hdfs://namenode/accumulo/recovery/0a60966c-b72d-4643-bf39-3fbfec342cc0
2013-12-17 21:10:52,624 [recovery.RecoveryManager] DEBUG: Recovering hdfs://nameservice/accumulo/wal/tserver1+9997/327e38cb-9f96-41a4-baff-a97d89d523e9
to hdfs://nameservice/accumulo/recovery/327e38cb-9f96-41a4-baff-a97d89d523e9

It seems like we should be able to bring all of these tablets back online at once (or at least
more than one every 10 seconds as the log showed) because the recovery file was created. This
made the complete recovery process take a bit longer than it should have as we waited 150s
before reassigning the last tablet.

This message was sent by Atlassian JIRA

View raw message