accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ed Coleman" <d...@etcoleman.com>
Subject RE: ASSIGNED_TO_DEAD_SERVER #walogs:2 [SEC=UNOFFICIAL]
Date Tue, 17 Dec 2019 03:39:18 GMT
We're dealing with a strange definition of "safe" here - but that would be
the intention.  For tablets that have assigned wals that it cannot recover
and you just want to move on, then deleting those entries will get those
tablets to stop trying to recover the info and move on.

 

You could check to see if the referenced files exist and if they have any
size - that would give you an idea of what magnitude of loss you will be
dealing with.

 

One everything is back, you may want to consider a full compaction as a
verification and root out any other issues.

 

Ed Coleman

 

From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au] 
Sent: Monday, December 16, 2019 10:31 PM
To: user@accumulo.apache.org
Subject: RE: ASSIGNED_TO_DEAD_SERVER #walogs:2 [SEC=UNOFFICIAL]

 

UNOFFICIAL

So when scanning the metadata table I get a bunch of rows. One has a column
family of log:...  and this one references the wal's.  There are actually
two rows with separate wal's listed.  

Ie:

4d4;blah  log:host_1:997/hdfs//./accumulo/wal/./111111111111111   []
hdfs://./accumulo/..

4d4;blah  log:host_1:997/hdfs//./accumulo/wal/./222222222222222   []
hdfs://./accumulo/..

 

Is it safe to remove both of these rows?   

 

Looking at other 'good' tablets they have no wal row.

 

From: Ed Coleman <dev1@etcoleman.com <mailto:dev1@etcoleman.com> > 
Sent: Tuesday, 17 December 2019 1:53 PM
To: user@accumulo.apache.org <mailto:user@accumulo.apache.org> 
Subject: RE: ASSIGNED_TO_DEAD_SERVER #walogs:2 [SEC=UNOFFICIAL]

 

This is off the top of my head and I don't have a local instance running to
try anything to see if that would knock a few brains cells into recalling.
so.

 

You don't say what version, but there used to be an issue that if multiple
tservers operated on a walog, the failed entry would cause the others to
also fail.  You might be able to remove the
hdfs://system/accumulo/recovery/434adfsdf124312f and the system will retry.

 

The other ways - as long as you really don't care what may be in the wal
logs.

 

Most certain, but also needs to be carefully done is to remove the wal
references in the metadata table.  One way, it to use the shell and scan the
metadata table for the tablet id range and pipe it to a file.  The with grep
/ awk, select the wal entries and the turn them into a deleterow command -
one catch is the printed version of the row id is like id.col_fam:col_qual -
and the delete command does not like the : (or it's the other way round.)
And you may need to grand delete privileges to the metadata table.

 

scan -t accumulo.metadata -b 4d4; -e 4d4~ -np

 

The other way would be to identify the "missing" files and put an empty file
into hdfs into its place.

 

Or, if you can take the table offline, then you can stash current files and
then create a new table and bulk import it (set the splits, or do it in
batches) Not sure if delete works on offline tables - that's one thing to
try first.

 

Ed Coleman

 

From: Dickson, Matt MR [mailto:matt.dickson@defence.gov.au] 
Sent: Monday, December 16, 2019 8:53 PM
To: user@accumulo.apache.org <mailto:user@accumulo.apache.org> 
Subject: RE: ASSIGNED_TO_DEAD_SERVER #walogs:2 [SEC=UNOFFICIAL]

 

UNOFFICIAL

A correction to my description:

 

Looking at the Accumulo gui on the 'Table Problems' section there are 8K
errors stating:

 

Table                     Problem Type                    Server
Time                      Resource                             Exception

Table                     TABLET_LOAD                   host
datetime              resourceUUID                    java.io.IOException:
..FileNoteFoundException: File does not exist:
hdfs://system/accumulo/recovery/434adfsdf124312f/failed/data

 

These seem to correspond to records in the accumulo.metadata table:

 

Scan -t accumulo.metadata -b ~err

 

~err_zxn  TABLET_LOAD: ....

 

From: Dickson, Matt MR <matt.dickson@defence.gov.au
<mailto:matt.dickson@defence.gov.au> > 
Sent: Tuesday, 17 December 2019 12:29 PM
To: user@accumulo.apache.org <mailto:user@accumulo.apache.org> 
Subject: ASSIGNED_TO_DEAD_SERVER #walogs:2 [SEC=UNOFFICIAL]

 

UNOFFICIAL

 

Hi,

 

I'm trying to recover from an issue that was caused by the
table.split.threshold being set to a very low size that then generated a
massive load on zookeeper and cluster nodes timing out communicating with
zookeeper while Accumulo was splitting tablets.  This was noticed when
tablet servers were being declared dead.

 

I've corrected the threshold and Accumulo is back online however there are
8K unhosted tablets that are not coming online.  

 

Running check the checkTablets script produces the exact number of errors as
there are unhosted tablets with a message like:

 

4d4;blah::words::4gfv43@(host:9997[23423442344234f23fd],null,null_ is
ASSIGNED_TO_DEAD_SERVER #walogs:2

 

I'm not concerned if there is data in these tablets and it is lost in
returning the system to a healthy state because I suspect other Accumulo
operations can't proceed while tablets are unhosted so just need to remove
these issues.

 

Any advice would be great.

 

Thanks in advance,

Matt


Mime
View raw message