accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ott, Charles H." <CHARLES.H....@saic.com>
Subject RE: Dead Tablet Server
Date Tue, 17 Sep 2013 19:32:12 GMT
 

 

From: user-return-3080-CHARLES.H.OTT=saic.com@accumulo.apache.org
[mailto:user-return-3080-CHARLES.H.OTT=saic.com@accumulo.apache.org] On
Behalf Of Keith Turner
Sent: Tuesday, September 17, 2013 3:20 PM
To: user@accumulo.apache.org
Subject: Re: Dead Tablet Server

 

 

 

On Tue, Sep 17, 2013 at 10:23 AM, Ott, Charles H.
<CHARLES.H.OTT@saic.com> wrote:

Forgive my ignorance with this, But I have not yet had a tablet failure
that I have been able to recover without restarting the entire accumulo
cluster.

 

I have 3 Tablets, 2 Online, 1 dead.  Using Accumulo 1.4.3

 

The tablet error reports:

Uncaught exception in TabletServer.main, exiting

         java.lang.RuntimeException: java.lang.RuntimeException: Too
many retries, exiting.

                 at
org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(T
abletServer.java:2684)

                 at
org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.ja
va:2703)

                 at
org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.j
ava:3168)

                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

                 at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
a:39)

                 at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
Impl.java:25)

                 at java.lang.reflect.Method.invoke(Method.java:597)

                 at org.apache.accumulo.start.Main$1.run(Main.java:89)

                 at java.lang.Thread.run(Thread.java:662)

         Caused by: java.lang.RuntimeException: Too many retries,
exiting.

                 at
org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(T
abletServer.java:2681)

                 ... 8 more

 

 

 

It would be nice to add this stack trace as a comment on ACCUMULO-1277
to make it easier to find via google.  Would you like to do this?  If
not I can.

 

                I just added it to the comments :
https://issues.apache.org/jira/browse/ACCUMULO-1277

	The recovery portion of the Admin guide says that recovery is
performed by asking the loggers to copy their write-ahead logs into
HDFS.  The logs are copied, sorted and then tablets can find missing
updates.  Once complete the tablets involved should return to an
'online' state.

	 

	I am not sure how to ask the loggers to copy their write-ahead
logs into hdfs.  Is this the same as using the flush shell command?  If
so, the flush command needs a pattern of tables or a table name.  Would
I want to perform something like, 'accumulo flush -p .+' to flush all of
the table data to HDFS?

	 

	Another concern is that the Tablet Server process was no longer
running on the server.  I logged into that server and ran
"start-here.sh".  The tablet server is now running, but it is still
reported as 'dead' to the monitor. 

	 

	Thanks in advance,

	Charles

 


Mime
View raw message