accumulo-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Josh Elser <josh.el...@gmail.com>
Subject Re: [ADVISORY] Possible data loss during HDFS decommissioning
Date Wed, 23 Sep 2015 17:06:18 GMT
-cc user@ (figure I'm forking this into a more dev-focused question now)

True, we don't have procedures for retroactively changing docs. I guess 
JIRA essentially acts as this version-affected discovery mechanism for 
us. People generally seem to understand a search of JIRA to find known 
issues too.

My only worry of creating a page on the website is that it's yet-another 
place people have to search to get the details on some operational 
subject. We've been doing well (since 1.6) to capture details like this 
in the user manual, so I figured this would also make sense to mention 
there. Perhaps multiple places is reasonable too?

dlmarion@comcast.net wrote:
> Known issue in the release notes on the web page? We would have to
> update every version though. Seems like we need a known issues document
> that lists issues in dependencies that transcend Accumulo versions.
>
> ------------------------------------------------------------------------
> *From: *"Josh Elser" <josh.elser@gmail.com>
> *To: *dev@accumulo.apache.org
> *Cc: *user@accumulo.apache.org
> *Sent: *Wednesday, September 23, 2015 10:26:50 AM
> *Subject: *Re: [ADVISORY] Possible data loss during HDFS decommissioning
>
> What kind of documentation can we put in the user manual about this?
> Recommend to only decom one rack at a time until we get the issue sorted
> out in Hadoop-land?
>
> dlmarion@comcast.net wrote:
>  > BLUF: There exists the possibility of data loss when performing
> DataNode decommissioning with Accumulo running. This note applies to
> installations of Accumulo 1.5.0+ and Hadoop 2.5.0+.
>  >
>  > DETAILS: During DataNode decommissioning it is possible for the
> NameNode to report stale block locations (HDFS-8208). If Accumulo is
> running during this process then it is possible that files currently
> being written will not close properly. Accumulo is affected in two ways:
>  >
>  > 1. During compactions temporary rfiles are created, then closed, and
> renamed. If a failure happens during the close, the compaction will fail.
>  > 2. Write ahead log files are created, written to, and then closed. If
> a failure happens during the close, then the NameNode will have a walog
> file with no finalized blocks.
>  >
>  > If either of these cases happen, decommissioning of the DataNode
> could hang (HDFS-3599, HDFS-5579) because the files are left in an open
> for write state. If Accumulo needs the write ahead log for recovery it
> will be unable to read the file and will not recover.
>  >
>  > RECOMMENDATION: Assuming that the replication pipeline for the write
> ahead log is working properly, then you should not run into this issue
> if you only decommission one rack at a time.
>  >
>

Mime
View raw message