Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@accumulo.apache.org
Received-SPF: pass (athena.apache.org: message received from 54.191.145.13
 which is an MX secondary for user@accumulo.apache.org)
Message-ID: <5548D1A4.3030706@gmail.com>
Date: Tue, 05 May 2015 10:20:20 -0400
From: Josh Elser <josh.elser@gmail.com>
User-Agent: Postbox 3.0.11 (Macintosh/20140602)
MIME-Version: 1.0
To: user@accumulo.apache.org
Subject: Re: Unassigned, but not offline, tablets
References: <98AC900E-477D-4099-9757-D8002AAD862A@gmail.com>
In-Reply-To: <98AC900E-477D-4099-9757-D8002AAD862A@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit

This is the troubleshooting steps I take (writing it down as it may 
eventually be more generally useful to people):

If there are continually unassigned tablets, it's _likely_ that tablets 
need to have log-recovery performed and, for some reason, that isn't 
happening.

1. Ensure that the Accumulo system tables (!METADATA in 1.5 and 
accumulo.root and accumulo.metadata in >=1.6) are fully available. You 
should be able to `scan -np -t <table>` these tables without issue -- 
you should be able to read the entire table and the scan command should 
not hang. If you cannot, your problem is worst-case and you may want to 
consider (see Metadata File Corruption under [1]).

2. If the system tables are OK, you can move to the assumption that it's 
a user table that these tablets are for. `accumulo admin checkTablets` 
may be of use. You have two options at this point

2a. Accept data loss. See instructions at [1] on removing log entries 
for tablets.

2b. Recover the corrupt data from HDFS (not covered here..)

I've seen situations where tablets that fail recovery don't send their 
logs to the Monitor. The master will likely have record of the reason 
the recovery failed, the tabletserver will definitely have record. Check 
the ends of the log files for both processes and you'll likely find an 
Exception as to why recovery keeps failing.

[1] http://accumulo.apache.org/1.6/accumulo_user_manual.html#_hdfs_failure

Bill Slacum wrote:
> After a catasrophic failure, the Master Server section of the monitor =
> will report that there are 16 unassigned tablets (out of thousands), but =
> no table shows any offline tablets.=20
>
> There were corrup files under the recovery directory. These were =
> removed.
>
> Otherwise, things seem fine with the cluster (we are having ingest =
> processes hang, which may or may not be related).
>
> What should I do, as an operator, when Accumulo is in this state?
>
> I have no logs provide, unfortunately.