accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ray Pfaff <ray.pf...@apx-labs.com>
Subject Communication issue between zookeeper and accumulo
Date Tue, 06 Aug 2013 14:38:23 GMT

I'm running zookeeper 1.4.3 and zookeeper 3.3.5 and I seem to have occasional communication
errors between the tablet servers and zookeeper.  Sometimes when I restart a tablet server,
I get the following error in my log:

INFO : Waiting for tablet server lock

(repeats numerous times)
INFO:Too many retries, exiting.

At this point the tserver process is still running, but it registers as dead to the master.
 I have to manually terminate the tserver and then restart it.  Usually by the second or third
try, I no longer get the "exiting" error and the server will begin to do work.  I'm running
4 tservers per machine dedicated to the tablet servers, so this makes for a pretty "manual"
method of restarting them.

I've looked at the code and the process is executing a Zoolock.trylock and failing.  It then
sleeps and tries again, ultimately terminating the try lock method after 60 attempts.  I also
note that Jira-954 looks almost exactly the same, if not the same as this error.  However,
it's listed as having been fixed in 1.4.3.

Is there some step in configuring either zookeeper or the tsservers that I've missed that
will get rid of this?

Mime
View raw message