accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Newton (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-942) accumulo should be more resilient in the face of NN failures
Date Tue, 08 Jan 2013 02:10:12 GMT
Eric Newton created ACCUMULO-942:
------------------------------------

             Summary: accumulo should be more resilient in the face of NN failures
                 Key: ACCUMULO-942
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-942
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
            Reporter: Eric Newton
            Assignee: Keith Turner
            Priority: Critical


We experienced a NN failure on a large cluster.  The edit log was written to a RAIDed file
system, but it did lose data sent to the edit log.  We suspect drivers making promises it
did not keep.

This left Accumulo in a slightly corrupt state: a few references to files that were missing.

Also, we have attempted to have backup images of HDFS archived for disaster recovery.  This
has not been helpful because Accumulo needs a highly consistent set of metadata, and a slightly
older version of the file system confuses it.

One defense is to use snapshots.  However, this works at the table level, and it is hard to
coordinate with the HDFS snapshot.

Another approach is to leave a short history of the files in the !METADATA table.  The Google
paper hints at keeping historical information:

{quote}
We also store secondary information in the
METADATA table, including a log of all events per-
taining to each tablet (such as when a server begins
serving it). This information is helpful for debugging
and performance analysis.
{quote}

I think it would also be helpful for disaster recovery.  It may require the GC to be more
sensitive to historical information about compactions.

Alternatively, we should start looking into high-availability NNs and bookkeeper high-performance
logging.



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message