Return-Path: X-Original-To: apmail-accumulo-user-archive@www.apache.org Delivered-To: apmail-accumulo-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id EDE5D17BEC for ; Tue, 24 Feb 2015 23:22:25 +0000 (UTC) Received: (qmail 79635 invoked by uid 500); 24 Feb 2015 23:22:25 -0000 Delivered-To: apmail-accumulo-user-archive@accumulo.apache.org Received: (qmail 79586 invoked by uid 500); 24 Feb 2015 23:22:25 -0000 Mailing-List: contact user-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@accumulo.apache.org Delivered-To: mailing list user@accumulo.apache.org Received: (qmail 79570 invoked by uid 99); 24 Feb 2015 23:22:25 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Feb 2015 23:22:25 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of winoto.kina.s@gmail.com designates 209.85.216.177 as permitted sender) Received: from [209.85.216.177] (HELO mail-qc0-f177.google.com) (209.85.216.177) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 24 Feb 2015 23:22:21 +0000 Received: by qcvs11 with SMTP id s11so207532qcv.8 for ; Tue, 24 Feb 2015 15:21:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=UE+KXnikeRFFqM2Ee1ILvN9U6rQrjJNJk0KzL/CVEpE=; b=kVQrw3TeE/GVu1dieCRjBu2cqTWS5UNjDneuGuRpXwbuOMFX1kOUMDAuwJApGnF9mh bZnTnyx1N+/O2D4qxRfoKRf+f16PrAKR4BknmduqW/pfLrfua3gYzJxLjusQWXphar4N JG1BwBVllOpbyVcxxUWgumhUA9KxukQjPPWrWZFE2hHLWEZQWZV+k2o8yOBmzckZVGuj qN9zv+XAs5ZHcwhfgoV/YxS6cytKUfeSWdU2qVdl1ilC8ekgk1UYWeSRBO/y0fSI1B+S 3ccpwV4R0RqOakTGf/o/hWWfnPnN7x1DoMh7DtwmY13ojnp+FMfm8VXT3NK3wogH7q2K m5Cg== MIME-Version: 1.0 X-Received: by 10.140.107.166 with SMTP id h35mr603017qgf.71.1424820075351; Tue, 24 Feb 2015 15:21:15 -0800 (PST) Received: by 10.229.66.4 with HTTP; Tue, 24 Feb 2015 15:21:15 -0800 (PST) In-Reply-To: <54ECEE2D.9040306@gmail.com> References: <1424801301188.9a04ce1e@Nodemailer> <54ECCD73.10000@gmail.com> <54ECEE2D.9040306@gmail.com> Date: Tue, 24 Feb 2015 15:21:15 -0800 Message-ID: Subject: Re: Adding a tablet to a tserver From: Kina Winoto To: user@accumulo.apache.org Content-Type: multipart/alternative; boundary=001a1139624475d4e6050fddc618 X-Virus-Checked: Checked by ClamAV on apache.org --001a1139624475d4e6050fddc618 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Thanks Josh. That was helpful; yes a migration to hadoop 2 is in our future= ! In the end, I decided to start a new instance like you ended up suggesting and bulk importing. Thanks for the help! On Tue, Feb 24, 2015 at 1:33 PM, Josh Elser wrote: > Ok, that helps a bit. A few things > > > "Could not create ServerSocket.." error as it can't connect to the > tserver. > > Note that this is a Server socket. This means that the server (master or > tabletserver) failed to bind the socket it was going to use for the Thrif= t > server. This means that Accumulo will not work as the processes can't > communicate with each other or clients. The error message should make it > fairly obvious as to why the exception was thrown. Hopefully, the process > killed itself too. > > > Hadoop 1.2.1 > > Hadoop 1 doesn't have the best track-record when it comes to ensure that = a > file is actually written to disk when we request it to be (a big part of > the reason we suggest to move to Hadoop 2 when you can). Hard poweroff ca= n > result in bad Accumulo files in HDFS. > > You can try adding dfs.datanode.synconclose=3Dtrue to your hdfs-site.xml > which might help protect against this, but I'm not sure of the error > handling of actually running out of space on the local disk. HDFS' reserv= ed > space configuration can help remove this worry by preventing writes when > HDFS is nearing full instead of the actual file system. > > > I deleted the wal logs, hoping that it would revert to what was in > /accumulo/tables > > Deleting the WALs also isn't doing what you expect it to :). The WALs, > especially for the metadata table, are extremely important and are needed > to ensure that data is not lost (if WALs for the metadata table are lost, > the table might be in an inconsistent state that Accumulo can't > automatically recover from). > > This is probably why your tables are not coming online. > > Recovering your existing instance might not be worth the hassle. It's > likely easier to just move the RFiles in HDFS out of the way, and then > reimport them into a reinitialized Accumulo. > > An outline of how to do this can be found at http://accumulo.apache.org/1= . > 6/accumulo_user_manual.html#_hdfs_failure under the *Q* "The metadata (or > root) table has references to a corrupt WAL". If you need some more > guidance than what is listed there, please feel free to ask! > > Kina Winoto wrote: > >> Hi Josh, >> >> > Versions of Hadoop and Accumulo: >> Hadoop 1.2.1 >> Accumulo 1.6.1 >> > Are the accumulo.metadata/!METADATA and/or accumulo.root tables onlin= e? >> Nope.. I tried to scan the tables -- it just hangs >> > Have you checked the logs of the Master and/or TabletServer for any >> exceptions? >> The master log is locked for read operation (an info message). I tried >> to shutdown the master with accumulo admin -f stopMaster, but it's still >> unhappy. >> The tserver log doesn't have any exceptions. However, if I run accumulo >> tserver -a localhost, then I'll get a "Could not create ServerSocket.." >> error as it can't connect to the tserver. >> >> For more context, I ran into all of this because I'm running this on a >> vm and I ran out of disk space so Accumulo could no longer write to the >> wal reliably and then checksums weren't matching up. After I created >> more space on my vm, I deleted the wal logs, hoping that it would revert >> to what was in /accumulo/tables, but then ran into this error where I >> have zero tablets. >> >> Thanks for any suggestions on what to do next! >> >> - Kina >> >> On Tue, Feb 24, 2015 at 11:13 AM, Josh Elser > > wrote: >> >> Hi Kina, >> >> Can you share some more information? >> >> * Versions of Hadoop and Accumulo >> * Are the accumulo.metadata/!METADATA and/or accumulo.root tables >> online? >> * Have you checked the logs of the Master and/or TabletServer for >> any exceptions? >> >> - Josh >> >> Kina Winoto wrote: >> >> Hi, >> >> I'm running a local instance of accumulo with just one tablet >> server. I >> got into a rut and now I don't have any tablets. There is data >> still in >> hdfs but I assume the data is corrupted so the tablets aren't >> being >> assigned to the tablet server. Is there a way I can force a >> tablet to be >> assigned? I don't mind giving up a portion of my data (or all of >> it) at >> this point. I'd just rather not have to reinitialize accumulo an= d >> recreate all the users and set up all my tables again. Maybe I >> can force >> a tablet assignment and then delete the tables that are corrupte= d? >> >> I've encountered a similar issue on a many-node cluster and >> would like >> to know if my only option is to reinitialize accumulo. >> >> Thanks! >> >> - Kina >> >> =E2=80=94 >> Sent from Mailbox > > >> >> >> --001a1139624475d4e6050fddc618 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Thanks Josh. That was helpful; yes a migration t= o hadoop 2 is in our future!

In the end, I decided to start a = new instance like you ended up suggesting and bulk importing.

Thanks for the help!

On Tue, Feb 24, 2015 at 1:33 PM, Josh Elser <josh.else= r@gmail.com> wrote:
Ok, tha= t helps a bit. A few things

> "Could not create ServerSocket.." error as it can't conn= ect to the tserver.

Note that this is a Server socket. This means that the server (master or ta= bletserver) failed to bind the socket it was going to use for the Thrift se= rver. This means that Accumulo will not work as the processes can't com= municate with each other or clients. The error message should make it fairl= y obvious as to why the exception was thrown. Hopefully, the process killed= itself too.

> Hadoop 1.2.1

Hadoop 1 doesn't have the best track-record when it comes to ensure tha= t a file is actually written to disk when we request it to be (a big part o= f the reason we suggest to move to Hadoop 2 when you can). Hard poweroff ca= n result in bad Accumulo files in HDFS.

You can try adding dfs.datanode.synconclose=3Dtrue to your hdfs-site.xml wh= ich might help protect against this, but I'm not sure of the error hand= ling of actually running out of space on the local disk. HDFS' reserved= space configuration can help remove this worry by preventing writes when H= DFS is nearing full instead of the actual file system.

> I deleted the wal logs, hoping that it would revert to what was in /ac= cumulo/tables

Deleting the WALs also isn't doing what you expect it to :). The WALs, = especially for the metadata table, are extremely important and are needed t= o ensure that data is not lost (if WALs for the metadata table are lost, th= e table might be in an inconsistent state that Accumulo can't automatic= ally recover from).

This is probably why your tables are not coming online.

Recovering your existing instance might not be worth the hassle. It's l= ikely easier to just move the RFiles in HDFS out of the way, and then reimp= ort them into a reinitialized Accumulo.

An outline of how to do this can be found at http= ://accumulo.apache.org/1.6/accumulo_user_manual.html#_hdfs_fa= ilure under the *Q* "The metadata (or root) table has references t= o a corrupt WAL". If you need some more guidance than what is listed t= here, please feel free to ask!

Kina Winoto wrote:
Hi Josh,

=C2=A0> Versions of Hadoop and Accumulo:
Hadoop 1.2.1
Accumulo 1.6.1
=C2=A0> Are the accumulo.metadata/!METADATA and/or accumulo.root tables = online?
Nope.. I tried to scan the tables -- it just hangs
=C2=A0> Have you checked the logs of the Master and/or TabletServer for = any
exceptions?
The master log is locked for read operation (an info message). I tried
to shutdown the master with accumulo admin -f stopMaster, but it's stil= l
unhappy.
The tserver log doesn't have any exceptions. However, if I run accumulo=
tserver -a localhost, then I'll get a "Could not create ServerSock= et.."
error as it can't connect to the tserver.

For more context, I ran into all of this because I'm running this on a<= br> vm and I ran out of disk space so Accumulo could no longer write to the
wal reliably and then checksums weren't matching up. After I created more space on my vm, I deleted the wal logs, hoping that it would revert to what was in /accumulo/tables, but then ran into this error where I
have zero tablets.

Thanks for any suggestions on what to do next!

- Kina

On Tue, Feb 24, 2015 at 11:13 AM, Josh Elser <josh.elser@gmail.com
<mailto:josh.e= lser@gmail.com>> wrote:

=C2=A0 =C2=A0 Hi Kina,

=C2=A0 =C2=A0 Can you share some more information?

=C2=A0 =C2=A0 * Versions of Hadoop and Accumulo
=C2=A0 =C2=A0 * Are the accumulo.metadata/!METADATA and/or accumulo.root ta= bles
=C2=A0 =C2=A0 online?
=C2=A0 =C2=A0 * Have you checked the logs of the Master and/or TabletServer= for
=C2=A0 =C2=A0 any exceptions?

=C2=A0 =C2=A0 - Josh

=C2=A0 =C2=A0 Kina Winoto wrote:

=C2=A0 =C2=A0 =C2=A0 =C2=A0 Hi,

=C2=A0 =C2=A0 =C2=A0 =C2=A0 I'm running a local instance of accumulo wi= th just one tablet
=C2=A0 =C2=A0 =C2=A0 =C2=A0 server. I
=C2=A0 =C2=A0 =C2=A0 =C2=A0 got into a rut and now I don't have any tab= lets. There is data
=C2=A0 =C2=A0 =C2=A0 =C2=A0 still in
=C2=A0 =C2=A0 =C2=A0 =C2=A0 hdfs but I assume the data is corrupted so the = tablets aren't being
=C2=A0 =C2=A0 =C2=A0 =C2=A0 assigned to the tablet server. Is there a way I= can force a
=C2=A0 =C2=A0 =C2=A0 =C2=A0 tablet to be
=C2=A0 =C2=A0 =C2=A0 =C2=A0 assigned? I don't mind giving up a portion = of my data (or all of
=C2=A0 =C2=A0 =C2=A0 =C2=A0 it) at
=C2=A0 =C2=A0 =C2=A0 =C2=A0 this point. I'd just rather not have to rei= nitialize accumulo and
=C2=A0 =C2=A0 =C2=A0 =C2=A0 recreate all the users and set up all my tables= again. Maybe I
=C2=A0 =C2=A0 =C2=A0 =C2=A0 can force
=C2=A0 =C2=A0 =C2=A0 =C2=A0 a tablet assignment and then delete the tables = that are corrupted?

=C2=A0 =C2=A0 =C2=A0 =C2=A0 I've encountered a similar issue on a many-= node cluster and
=C2=A0 =C2=A0 =C2=A0 =C2=A0 would like
=C2=A0 =C2=A0 =C2=A0 =C2=A0 to know if my only option is to reinitialize ac= cumulo.

=C2=A0 =C2=A0 =C2=A0 =C2=A0 Thanks!

=C2=A0 =C2=A0 =C2=A0 =C2=A0 - Kina

=C2=A0 =C2=A0 =C2=A0 =C2=A0 =E2=80=94
=C2=A0 =C2=A0 =C2=A0 =C2=A0 Sent from Mailbox <https://www.dropbox.com/__mai= lbox
=C2=A0 =C2=A0 =C2=A0 =C2=A0 <https://www.dropbox.com/mailbox>>



--001a1139624475d4e6050fddc618--