Return-Path: X-Original-To: apmail-accumulo-commits-archive@www.apache.org Delivered-To: apmail-accumulo-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2AE94107D1 for ; Thu, 6 Mar 2014 15:28:29 +0000 (UTC) Received: (qmail 36235 invoked by uid 500); 6 Mar 2014 15:28:28 -0000 Delivered-To: apmail-accumulo-commits-archive@accumulo.apache.org Received: (qmail 36184 invoked by uid 500); 6 Mar 2014 15:28:28 -0000 Mailing-List: contact commits-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@accumulo.apache.org Delivered-To: mailing list commits@accumulo.apache.org Received: (qmail 36176 invoked by uid 99); 6 Mar 2014 15:28:27 -0000 Received: from tyr.zones.apache.org (HELO tyr.zones.apache.org) (140.211.11.114) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 06 Mar 2014 15:28:27 +0000 Received: by tyr.zones.apache.org (Postfix, from userid 65534) id A008D9391BD; Thu, 6 Mar 2014 15:28:27 +0000 (UTC) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: ecn@apache.org To: commits@accumulo.apache.org Date: Thu, 06 Mar 2014 15:28:27 -0000 Message-Id: X-Mailer: ASF-Git Admin Mailer Subject: [1/2] git commit: ACCUMULO-1220 added some advanced recovery options Repository: accumulo Updated Branches: refs/heads/master ffb22b880 -> 384bc842d ACCUMULO-1220 added some advanced recovery options Project: http://git-wip-us.apache.org/repos/asf/accumulo/repo Commit: http://git-wip-us.apache.org/repos/asf/accumulo/commit/75c3f28d Tree: http://git-wip-us.apache.org/repos/asf/accumulo/tree/75c3f28d Diff: http://git-wip-us.apache.org/repos/asf/accumulo/diff/75c3f28d Branch: refs/heads/master Commit: 75c3f28dddf1d8df63658b4b770fc8c34806eb3c Parents: 657a4e5 Author: Eric Newton Authored: Thu Mar 6 10:27:54 2014 -0500 Committer: Eric Newton Committed: Thu Mar 6 10:27:54 2014 -0500 ---------------------------------------------------------------------- .../chapters/troubleshooting.tex | 87 +++++++++++++++++++- 1 file changed, 83 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/accumulo/blob/75c3f28d/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex ---------------------------------------------------------------------- diff --git a/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex b/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex index 91fb156..8ba7176 100644 --- a/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex +++ b/docs/src/main/latex/accumulo_user_manual/chapters/troubleshooting.tex @@ -140,8 +140,8 @@ Zookeeper processes talk to each other to elect a leader. All updates go through the leader and propagate to a majority of all the other nodes. If a majority of the nodes cannot be reached, zookeeper will not allow updates. Zookeeper also limits the number connections to a -server from any other single host. By default, this limit is 10, and -can be reached in some everything-on-one-machine test configurations. +server from any other single host. By default, this limit can be as small as 10 +and can be reached in some everything-on-one-machine test configurations. You can check the election status and connection status of clients by asking the zookeeper nodes for their status. You connect to zookeeper @@ -200,7 +200,7 @@ access this memory, the OS will begin flushing disk buffers to return that memory to the VM. This can cause the entire process to block long enough for the zookeeper lock to be lost. -A. Configure your system to reduce the kernel parameter ``swappiness'' from the default (30) to zero. +A. Configure your system to reduce the kernel parameter ``swappiness'' from the default (60) to zero. Q. My tablet server lost its lock, and I have already set swappiness to zero. Why? @@ -447,6 +447,7 @@ INFO : Using ZooKeepers localhost:2181 \normalsize \section{System Metadata Tables} +\label{sec:metadata} Accumulo tracks information about tables in metadata tables. The metadata for most tables is contained within the metadata table in the accumulo namespace, @@ -517,6 +518,84 @@ Besides these columns, you may see: \end{enumerate} +\section{Advanced System Recovery} -\section{} +Q. I had disasterous HDFS failure. After bringing everything back up, several tablets refuse to go online. +Data written to tablets is written into memory before being written into indexed files. In case the server +is lost before the data is saved into a an indexed file, all data stored in memory is first written into a +write-ahead log (WAL). When a tablet is re-assigned to a new tablet server, the write-ahead logs are read to +recover any mutations that were in memory when the tablet was last hosted. + +If a write-ahead log cannot be read, then the tablet is not re-assigned. All it takes is for one of +the blocks in the write-ahead log to be missing. This is unlikely unless multiple data nodes in HDFS have been +lost. + +A. Get the WAL files online and healthy. Restore any data nodes that may be down. + +Q. How do find out which tablets are offline? + +A. Use ``accumulo admin checkTablets'' + +\small +\begin{verbatim} + $ bin/accumulo admin checkTablets +\end{verbatim} +\normalsize + +Q. I lost three data nodes, and I'm missing blocks in a WAL. I don't care about data loss, how +can I get those tablets online? + +See the discussion in section~\ref{sec:metadata}, which shows a typical metadata table listing. +The entries with a column family of ``log'' are references to the WAL for that tablet. +If you know what WAL is bad, you can find all the references with a grep in the shell: + +\small +\begin{verbatim} +shell> grep 0cb7ce52-ac46-4bf7-ae1d-acdcfaa97995 +3< log:127.0.0.1+9997/0cb7ce52-ac46-4bf7-ae1d-acdcfaa97995 [] 127.0.0.1+9997/0cb7ce52-ac46-4bf7-ae1d-acdcfaa97995|6 +\end{verbatim} +\normalsize + +A. You can remove the WAL references in the metadata table. + +\small +\begin{verbatim} +shell> grant -u root Table.WRITE -t accumulo.metadata +shell> delete 3< log 127.0.0.1+9997/0cb7ce52-ac46-4bf7-ae1d-acdcfaa97995 +\end{verbatim} +\normalsize + +Note: the colon (``:'') is omitted when specifying the ``row cf cq'' for the delete command. + +The master will automatically discover the tablet no longer has a bad WAL reference and will +assign the tablet. You will need to remove the reference from all the tablets to get them +online. + + +Q. The metadata (or root) table has references to a corrupt WAL. + +This is a much more serious state, since losing updates to the metadata table will result +in references to old files which may not exist, or lost references to new files, resulting +in tablets that cannot be read, or large amounts of data loss. + +The best hope is to restore the WAL by fixing HDFS data nodes and bringing the data back online. +If this is not possible, the best approach is to re-create the instance and bulk import all files from +the old instance into a new tables. + +A complete set of instructions for doing this is outside the scope of this guide, +but the basic approach is: + +\begin{itemize} + \item Use ``tables -l'' in the shell to discover the table name to table id mapping + \item Stop all accumulo processes on all nodes + \item Move the accumulo directory in HDFS out of the way: +\small +\begin{verbatim} + $ hadoop fs -mv /accumulo /corrupt +\end{verbatim} +\normalsize + \item Re-initalize accumulo + \item Recreate tables, users and permissions + \item Import the directories under \texttt{/corrupt/tables/} into the new instance +\end{itemize}