accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Frans Lawaetz <flawa...@gmail.com>
Subject Re: ZooKeeper ConnectionLoss in Accumulo 1.4.5
Date Mon, 14 Apr 2014 18:59:08 GMT
The system swappiness warning is a bit of a red herring in that the systems
aren't configured with any swap space.  They all have 64GB RAM of which
currently ~50GB is sitting as fs cache.  The load on these systems was very
high during ingest so I'm sure there was IO latency even without swap use.

In reviewing the log I see lots of promises about "will retry" (without the
usual 250 or 500ms qualifier) for the various connections to ZK that are
lost followed by a fatal event once the tablet server lock is lost.  It's
not clear to me though that Accumulo does actually try to reconnect or fail
over.

Given that the other tservers stayed up, as well as the master, all of whom
were configured to us the same ZK members, it would appear that there were
functional ZK services available and that the failing tserver bailed
prematurely.

Beyond the ZK connection timeout parameter (set to 30s by default) are
there other settings that can make accumulo more tolerant of ZK glitches?



On Mon, Apr 14, 2014 at 1:11 PM, Sean Busbey <busbey@cloudera.com> wrote:

> The log looks like it is retrying the ZK connection issues but that it
> independently lost the lock.
>
> The very start of the log claims you have vm.swappiness set to 60. Can you
> zero this out and see if the issue still happens?
>
> Also, check to see if you're hitting swap once the user is running a shell
> command on that host. If you start swapping the pauses will cause services
> to lose their ZK locks.
>
>
>
>
> On Mon, Apr 14, 2014 at 10:00 AM, Frans Lawaetz <flawaetz@gmail.com>wrote:
>
>>
>> Hi-
>>
>> I'm running a five-node Accumulo 1.4.5 cluster with zookeeper 3.4.6
>> distributed across the same systems.
>>
>> We've seen a couple tserver failures in a manifestation that feels
>> similar to ACCUMULO-1572 (which was patched in 1.4.5).  What is perhaps
>> unique in this circumstance is that the user reported these failures
>> occurring immediately upon entering a command in the accumulo shell.  The
>> commands were a routine scan and delete.  The error is attached but boils
>> down to:
>>
>> 2014-04-09 21:48:49,552 [zookeeper.ZooLock] WARN : lost connection to
>>> zookeeper
>>> 2014-04-09 21:48:49,552 [zookeeper.ZooCache] WARN : Zookeeper error,
>>> will retry
>>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>>> KeeperErrorCode = ConnectionLoss for /acc....
>>> 2014-04-09 21:48:49,554 [zookeeper.DistributedWorkQueue] INFO : Got
>>> unexpected zookeeper event: None
>>> [ repeat the above a few times and then finally ]
>>> 2014-04-09 21:48:51,866 [tabletserver.TabletServer] FATAL: Lost ability
>>> to monitor tablet server lock, exiting.
>>
>>
>> The zookeeper arrangement here is non-optimal in that they're working on
>> the same virtualized disk as the hadoop and accumulo processes.  The system
>> was performing bulk ingest at the time so contention was very likely an
>> issue.
>>
>> Zookeeper did report, at essentially the same millisecond:
>>
>> 2014-04-09 21:48:49,551 [myid:1] - WARN
>>>  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
>>> following the leader
>>> java.net.SocketTimeoutException: Read timed out
>>> [ followed by a number of ]
>>> 2014-04-09 21:48:49,919 [myid:1] - WARN  [NIOServerCxn.Factory:
>>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of
>>> session 0x0 due to java.io.IOException: ZooKeeperServer not running
>>
>>
>> It's important to note however that:
>>
>> - The ZooKeeper errors above occur many other times in the logs and the
>> accumulo cluster has been ok.
>> - The ZooKeeper ensemble recovered without intervention.
>> - The WARN to FATAL time for Accumulo was just two seconds whereas I was
>> under the impression the process would only give up after two retry
>> attempts lasting 30s each.
>> - Only the tserver on the system where the user was running the accumulo
>> shell failed and only (we believe) upon issuance of a command.
>> - accumulo-site.xml on all nodes is configured with three zookeepers so
>> the system should be attempting to fail over.
>>
>> Thanks,
>> Frans
>>
>>
>
>
> --
> Sean
>



-- 
Ph: 617.306.8083

Mime
View raw message