accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <bus...@cloudera.com>
Subject Re: ZooKeeper ConnectionLoss in Accumulo 1.4.5
Date Mon, 14 Apr 2014 17:11:15 GMT
The log looks like it is retrying the ZK connection issues but that it
independently lost the lock.

The very start of the log claims you have vm.swappiness set to 60. Can you
zero this out and see if the issue still happens?

Also, check to see if you're hitting swap once the user is running a shell
command on that host. If you start swapping the pauses will cause services
to lose their ZK locks.




On Mon, Apr 14, 2014 at 10:00 AM, Frans Lawaetz <flawaetz@gmail.com> wrote:

>
> Hi-
>
> I'm running a five-node Accumulo 1.4.5 cluster with zookeeper 3.4.6
> distributed across the same systems.
>
> We've seen a couple tserver failures in a manifestation that feels similar
> to ACCUMULO-1572 (which was patched in 1.4.5).  What is perhaps unique in
> this circumstance is that the user reported these failures occurring
> immediately upon entering a command in the accumulo shell.  The commands
> were a routine scan and delete.  The error is attached but boils down to:
>
> 2014-04-09 21:48:49,552 [zookeeper.ZooLock] WARN : lost connection to
>> zookeeper
>> 2014-04-09 21:48:49,552 [zookeeper.ZooCache] WARN : Zookeeper error, will
>> retry
>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>> KeeperErrorCode = ConnectionLoss for /acc....
>> 2014-04-09 21:48:49,554 [zookeeper.DistributedWorkQueue] INFO : Got
>> unexpected zookeeper event: None
>> [ repeat the above a few times and then finally ]
>> 2014-04-09 21:48:51,866 [tabletserver.TabletServer] FATAL: Lost ability
>> to monitor tablet server lock, exiting.
>
>
> The zookeeper arrangement here is non-optimal in that they're working on
> the same virtualized disk as the hadoop and accumulo processes.  The system
> was performing bulk ingest at the time so contention was very likely an
> issue.
>
> Zookeeper did report, at essentially the same millisecond:
>
> 2014-04-09 21:48:49,551 [myid:1] - WARN
>>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
>> following the leader
>> java.net.SocketTimeoutException: Read timed out
>> [ followed by a number of ]
>> 2014-04-09 21:48:49,919 [myid:1] - WARN  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
>> session 0x0 due to java.io.IOException: ZooKeeperServer not running
>
>
> It's important to note however that:
>
> - The ZooKeeper errors above occur many other times in the logs and the
> accumulo cluster has been ok.
> - The ZooKeeper ensemble recovered without intervention.
> - The WARN to FATAL time for Accumulo was just two seconds whereas I was
> under the impression the process would only give up after two retry
> attempts lasting 30s each.
> - Only the tserver on the system where the user was running the accumulo
> shell failed and only (we believe) upon issuance of a command.
> - accumulo-site.xml on all nodes is configured with three zookeepers so
> the system should be attempting to fail over.
>
> Thanks,
> Frans
>
>


-- 
Sean

Mime
View raw message