accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ott, Charles H." <CHARLES.H....@saic.com>
Subject RE: Dead Tablet Server
Date Tue, 17 Sep 2013 17:33:20 GMT
I have also verified that my firewalls are off for the zookeeper server and that the maxClientCnxns=0.

 

From: user-return-3074-CHARLES.H.OTT=saic.com@accumulo.apache.org [mailto:user-return-3074-CHARLES.H.OTT=saic.com@accumulo.apache.org]
On Behalf Of Ott, Charles H.
Sent: Tuesday, September 17, 2013 1:20 PM
To: user@accumulo.apache.org
Subject: RE: Dead Tablet Server

 

I tried restarting the tablet server and tail –f the logs folder.  I’m seeing this message
every 5 seconds:

 

==> tserver_1620-Node1.log <==

2013-09-17 13:11:24,836 [tabletserver.TabletServer] INFO : Waiting for tablet server lock

 

==> tserver_1620-Node1.debug.log <==

2013-09-17 13:11:29,861 [zookeeper.ZooLock] DEBUG: event /accumulo/6cbe5596-8803-46b2-8bba-79f5ceda599a/tservers/10.35.56.91:9997
NodeCreated SyncConnected

2013-09-17 13:11:29,886 [tabletserver.TabletServer] INFO : Waiting for tablet server lock

 

Of course after a while it says too many retries and stops. I double checked the zooNode /accumulo/6cbe5596-8803-46b2-8bba-79f5ceda599a/tservers,
but again there was no entry for 10.35.56.91.  only 92 and 93.

 

Is there any reason why doing ./stop-all.sh and then ./start-all.sh would resolve this issue.
Last time this happened I was able to restart the entire cluster and all 3 nodes came back
online.  However, it’s only been about 2 days since this error has occurred, so I think
it would be best for me to fix the issue, rather than trying to restart over and over.

 

 

From: user-return-3070-CHARLES.H.OTT=saic.com@accumulo.apache.org [mailto:user-return-3070-CHARLES.H.OTT=saic.com@accumulo.apache.org]
On Behalf Of Ott, Charles H.
Sent: Tuesday, September 17, 2013 11:04 AM
To: user@accumulo.apache.org
Subject: RE: Dead Tablet Server

 

 

 

From: user-return-3068-CHARLES.H.OTT=saic.com@accumulo.apache.org [mailto:user-return-3068-CHARLES.H.OTT=saic.com@accumulo.apache.org]
On Behalf Of Josh Elser
Sent: Tuesday, September 17, 2013 10:39 AM
To: user@accumulo.apache.org
Subject: Re: Dead Tablet Server

 

 

On Tue, Sep 17, 2013 at 10:23 AM, Ott, Charles H. <CHARLES.H.OTT@saic.com> wrote:

Forgive my ignorance with this, But I have not yet had a tablet failure that I have been able
to recover without restarting the entire accumulo cluster.

I have 3 Tablets, 2 Online, 1 dead.  Using Accumulo 1.4.3 

The tablet error reports:

Uncaught exception in TabletServer.main, exiting

         java.lang.RuntimeException: java.lang.RuntimeException: Too many retries, exiting.

                 at org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2684)

                 at org.apache.accumulo.server.tabletserver.TabletServer.run(TabletServer.java:2703)

                 at org.apache.accumulo.server.tabletserver.TabletServer.main(TabletServer.java:3168)

                 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

                 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

                 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

                 at java.lang.reflect.Method.invoke(Method.java:597)

                 at org.apache.accumulo.start.Main$1.run(Main.java:89)

                 at java.lang.Thread.run(Thread.java:662)

         Caused by: java.lang.RuntimeException: Too many retries, exiting.

                 at org.apache.accumulo.server.tabletserver.TabletServer.announceExistence(TabletServer.java:2681)

                 ... 8 more

 

Looking at the code, the tablet server couldn't obtain a lock for itself (using its IP:port).
I would start looking there. You could use zkCli.sh provided by ZooKeeper and look in /accumulo/${instance_id}/tservers/${ip}:${port}
to see if there is another server which already has the lock somehow.

 

                I logged into the zookeeper using zkCli and checked the path you mentioned.

[zk: 1620-accumulo.dhcp.saic.com(CONNECTED) 2] ls /accumulo/6cbe5596-8803-46b2-8bba-79f5ceda599a/tservers

[10.35.56.92:9997, 10.35.56.93:9997]

 

There are only 2 servers listed there.  Both servers are the ‘online’ tablet servers that
are working okay.  So I guess there is no other server which already has the lock?  The tablet
server that is dead is 10.35.56.91:9997 I believe.  As my 3 server’s  IP addresses follow
the pattern x.x.x.91,92,93. 

 

	 

	The recovery portion of the Admin guide says that recovery is performed by asking the loggers
to copy their write-ahead logs into HDFS.  The logs are copied, sorted and then tablets can
find missing updates.  Once complete the tablets involved should return to an ‘online’
state.

	 

	I am not sure how to ask the loggers to copy their write-ahead logs into hdfs.  Is this the
same as using the flush shell command?  If so, the flush command needs a pattern of tables
or a table name.  Would I want to perform something like, ‘accumulo flush -p .+’ to flush
all of the table data to HDFS?

 

You shouldn't have to do anything manually here. The loggers should be handling this completely
for you as a part of their normal operations. The most likely issue you may run into if you're
missing WALs is if your logger process doesn't have enough memory to perform that copy/sort/etc
but this is easily verified by checking the logger*.out file for an OOME.


            I don’t see any OutOfMemory exceptions in my logs.  The Xmx on my tserver is
set to 384m while the tserver.memory.maps.max is set to 256m.  The admin docs mention having
the memory.max.maps ~75% of the Xmx setting.  I guess I’m around 66%?  There are some other
custom memory settings, tserver.cache.data.size is 15m, tserver.cache.index.size is 40m, logger.sort.buffer.size
is 50m, and tserver.walog.max.size is 256m.  If all of those values combined were ‘maxed
out’, wouldn’t that be well above the 384m of Xmx?

	 

	Another concern is that the Tablet Server process was no longer running on the server.  I
logged into that server and ran “start-here.sh”.  The tablet server is now running, but
it is still reported as ‘dead’ to the monitor. 

 

Can you determine from the monitor if that tablet server is actually hosting tablets? 1.4.3
had a couple of bugs around the master not updating it's internal state for nodes in the failed
state. Check the Tablet Server page and see if there's an entry in the table of servers.

 

            I can confirm that the tablet server was actually hosting tablets when it was
up and running.  The three tservers seemd to be well balanced with each tserver hosting between
50 and 60 tablets. (Currently 163 tablets total)

	 

	Thanks in advance,

	Charles

 

Mime
View raw message