From Frans Lawaetz <>
Subject ZooKeeper ConnectionLoss in Accumulo 1.4.5
Date Mon, 14 Apr 2014 17:00:31 GMT

I'm running a five-node Accumulo 1.4.5 cluster with zookeeper 3.4.6
distributed across the same systems.

We've seen a couple tserver failures in a manifestation that feels similar
to ACCUMULO-1572 (which was patched in 1.4.5).  What is perhaps unique in
this circumstance is that the user reported these failures occurring
immediately upon entering a command in the accumulo shell.  The commands
were a routine scan and delete.  The error is attached but boils down to:

2014-04-09 21:48:49,552 [zookeeper.ZooLock] WARN : lost connection to
> zookeeper
> 2014-04-09 21:48:49,552 [zookeeper.ZooCache] WARN : Zookeeper error, will
> retry
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for /acc....
> 2014-04-09 21:48:49,554 [zookeeper.DistributedWorkQueue] INFO : Got
> unexpected zookeeper event: None
> [ repeat the above a few times and then finally ]
> 2014-04-09 21:48:51,866 [tabletserver.TabletServer] FATAL: Lost ability to
> monitor tablet server lock, exiting.

The zookeeper arrangement here is non-optimal in that they're working on
the same virtualized disk as the hadoop and accumulo processes.  The system
was performing bulk ingest at the time so contention was very likely an

Zookeeper did report, at essentially the same millisecond:

2014-04-09 21:48:49,551 [myid:1] - WARN
>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> Read timed out
> [ followed by a number of ]
> 2014-04-09 21:48:49,919 [myid:1] - WARN  [NIOServerCxn.Factory:
>] - Exception causing close of
> session 0x0 due to ZooKeeperServer not running

It's important to note however that:

- The ZooKeeper errors above occur many other times in the logs and the
accumulo cluster has been ok.
- The ZooKeeper ensemble recovered without intervention.
- The WARN to FATAL time for Accumulo was just two seconds whereas I was
under the impression the process would only give up after two retry
attempts lasting 30s each.
- Only the tserver on the system where the user was running the accumulo
shell failed and only (we believe) upon issuance of a command.
- accumulo-site.xml on all nodes is configured with three zookeepers so the
system should be attempting to fail over.


