From "Dickson, Matt MR" <>
Date Tue, 17 Dec 2019 04:49:23 GMT

After running some test queries on the metadata table I noticed all problem records referenced
wal's one specific server in the hdfs path.  I moved that to temp and replaced it with an
empty file and then restart Accumulo and it has come back and is running correctly.

I did also test deleting a few of the log rows in the metadata table and it didn't create
any issues.  I was going to move to that approach if the wal file move failed but fortunately
didn't need to.

Thanks for quick response to my issue Ed it was a massive help.

Merry Christmas

From: Ed Coleman <>
Sent: Tuesday, 17 December 2019 2:55 PM

Also, the number of entires per row that you will need to delete should match the command
output you first posted:

Ø  4d4;blah::words::4gfv43@(host:9997[23423442344234f23fd],null,null_ is ASSIGNED_TO_DEAD_SERVER

So, yes, for that row you'd need to delete two entries.

Ed Coleman

From: Ed Coleman []
Sent: Monday, December 16, 2019 10:39 PM

We're dealing with a strange definition of "safe" here - but that would be the intention.
 For tablets that have assigned wals that it cannot recover and you just want to move on,
then deleting those entries will get those tablets to stop trying to recover the info and
move on.

You could check to see if the referenced files exist and if they have any size - that would
give you an idea of what magnitude of loss you will be dealing with.

One everything is back, you may want to consider a full compaction as a verification and root
out any other issues.

Ed Coleman

From: Dickson, Matt MR []
Sent: Monday, December 16, 2019 10:31 PM

So when scanning the metadata table I get a bunch of rows. One has a column family of log:.....
 and this one references the wal's.  There are actually two rows with separate wal's listed.
4d4;blah  log:host_1:997/hdfs//.../accumulo/wal/.../111111111111111   []   hdfs://.../accumulo/....
4d4;blah  log:host_1:997/hdfs//.../accumulo/wal/.../222222222222222   []   hdfs://.../accumulo/....

Is it safe to remove both of these rows?

Looking at other 'good' tablets they have no wal row.

From: Ed Coleman <<>>
Sent: Tuesday, 17 December 2019 1:53 PM

This is off the top of my head and I don't have a local instance running to try anything to
see if that would knock a few brains cells into recalling... so.

You don't say what version, but there used to be an issue that if multiple tservers operated
on a walog, the failed entry would cause the others to also fail.  You might be able to remove
the hdfs://system/accumulo/recovery/434adfsdf124312f and the system will retry.

The other ways - as long as you really don't care what may be in the wal logs.

Most certain, but also needs to be carefully done is to remove the wal references in the metadata
table.  One way, it to use the shell and scan the metadata table for the tablet id range and
pipe it to a file.  The with grep / awk, select the wal entries and the turn them into a deleterow
command - one catch is the printed version of the row id is like id...col_fam:col_qual - and
the delete command does not like the : (or it's the other way round...)  And you may need
to grand delete privileges to the metadata table...

scan -t accumulo.metadata -b 4d4; -e 4d4~ -np

The other way would be to identify the "missing" files and put an empty file into hdfs into
its place.

Or, if you can take the table offline, then you can stash current files and then create a
new table and bulk import it (set the splits, or do it in batches) Not sure if delete works
on offline tables - that's one thing to try first.

Ed Coleman

From: Dickson, Matt MR []
Sent: Monday, December 16, 2019 8:53 PM

A correction to my description:

Looking at the Accumulo gui on the 'Table Problems' section there are 8K errors stating:

Table                     Problem Type                    Server                   Time  
                   Resource                             Exception
Table                     TABLET_LOAD                   host                       datetime
             resourceUUID           ....FileNoteFoundException:
File does not exist: hdfs://system/accumulo/recovery/434adfsdf124312f/failed/data

These seem to correspond to records in the accumulo.metadata table:

Scan -t accumulo.metadata -b ~err

~err_zxn  TABLET_LOAD: ............

From: Dickson, Matt MR <<>>
Sent: Tuesday, 17 December 2019 12:29 PM



I'm trying to recover from an issue that was caused by the table.split.threshold being set
to a very low size that then generated a massive load on zookeeper and cluster nodes timing
out communicating with zookeeper while Accumulo was splitting tablets.  This was noticed when
tablet servers were being declared dead.

I've corrected the threshold and Accumulo is back online however there are 8K unhosted tablets
that are not coming online.

Running check the checkTablets script produces the exact number of errors as there are unhosted
tablets with a message like:

4d4;blah::words::4gfv43@(host:9997[23423442344234f23fd],null,null_ is ASSIGNED_TO_DEAD_SERVER

I'm not concerned if there is data in these tablets and it is lost in returning the system
to a healthy state because I suspect other Accumulo operations can't proceed while tablets
are unhosted so just need to remove these issues.

Any advice would be great.

Thanks in advance,

