Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4B1981058B for ; Tue, 5 May 2015 14:21:41 +0000 (UTC) Received: (qmail 25351 invoked by uid 500); 5 May 2015 14:21:36 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 25304 invoked by uid 500); 5 May 2015 14:21:36 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 25293 invoked by uid 99); 5 May 2015 14:21:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 May 2015 14:21:36 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.191.145.13 which is an MX secondary for user@accumulo.apache.org) Received: from [54.191.145.13] (HELO mx1-us-west.apache.org) (54.191.145.13) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 May 2015 14:21:28 +0000 Received: from mail-qk0-f171.google.com (mail-qk0-f171.google.com [209.85.220.171]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 854BF24C07 for ; Tue, 5 May 2015 14:21:08 +0000 (UTC) Received: by qkgx75 with SMTP id x75so106751496qkg.1 for ; Tue, 05 May 2015 07:20:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=2pun/w986K3PKj8jZBrhleG8u2+yu/7RALDeKyBMEMI=; b=K4NHJdo3orOpYDGKOjOTyJrjJBFjWwOKOBycLVKYlS8IcwDeUkVw+LYwgoDKhv3FQp uInfCRuUJIkypmE2jtCXLZF2Yyd1m4my0XBRmk57dCLMN3AaRbEYldoXemua+7LSxFGK OAuoA7DyPpmuV5gPzpKOgNmavxOtqAgOzBtqo2/vVkc0svSufYpdi6jizrC0QzTGox0Q r0Jj3SQmtYqKws+MDvB+A4sKAohKMMs7q63vaIztbiJVjx024Bpn3xukU+E/UuBfkfjW VV5n5Npkz4UWZeF2kuXfVwMfa0YNXCX1BxWsqUAqn+ZweLbwmMw/yfbFXvUT9lSduCqz Qq9g== X-Received: by 10.55.20.132 with SMTP id 4mr57102784qku.104.1430835622552; Tue, 05 May 2015 07:20:22 -0700 (PDT) Received: from hw10447.local (pool-72-81-135-153.bltmmd.fios.verizon.net. [72.81.135.153]) by mx.google.com with ESMTPSA id a10sm12303629qga.13.2015.05.05.07.20.21 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 05 May 2015 07:20:22 -0700 (PDT) Message-ID: <5548D1A4.3030706@gmail.com> Date: Tue, 05 May 2015 10:20:20 -0400 From: Josh Elser User-Agent: Postbox 3.0.11 (Macintosh/20140602) MIME-Version: 1.0 To: user@accumulo.apache.org Subject: Re: Unassigned, but not offline, tablets References: <98AC900E-477D-4099-9757-D8002AAD862A@gmail.com> In-Reply-To: <98AC900E-477D-4099-9757-D8002AAD862A@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org This is the troubleshooting steps I take (writing it down as it may eventually be more generally useful to people): If there are continually unassigned tablets, it's _likely_ that tablets need to have log-recovery performed and, for some reason, that isn't happening. 1. Ensure that the Accumulo system tables (!METADATA in 1.5 and accumulo.root and accumulo.metadata in >=1.6) are fully available. You should be able to `scan -np -t ` these tables without issue -- you should be able to read the entire table and the scan command should not hang. If you cannot, your problem is worst-case and you may want to consider (see Metadata File Corruption under [1]). 2. If the system tables are OK, you can move to the assumption that it's a user table that these tablets are for. `accumulo admin checkTablets` may be of use. You have two options at this point 2a. Accept data loss. See instructions at [1] on removing log entries for tablets. 2b. Recover the corrupt data from HDFS (not covered here..) I've seen situations where tablets that fail recovery don't send their logs to the Monitor. The master will likely have record of the reason the recovery failed, the tabletserver will definitely have record. Check the ends of the log files for both processes and you'll likely find an Exception as to why recovery keeps failing. [1] http://accumulo.apache.org/1.6/accumulo_user_manual.html#_hdfs_failure Bill Slacum wrote: > After a catasrophic failure, the Master Server section of the monitor = > will report that there are 16 unassigned tablets (out of thousands), but = > no table shows any offline tablets.=20 > > There were corrup files under the recovery directory. These were = > removed. > > Otherwise, things seem fine with the cluster (we are having ingest = > processes hang, which may or may not be related). > > What should I do, as an operator, when Accumulo is in this state? > > I have no logs provide, unfortunately.