accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Zookeeper ConnectionLossException
Date Fri, 30 Mar 2012 14:31:53 GMT
If a client fails to communicate with zookeeper for long enough, it loses
its lock, and loses its exclusive access to the tablets it was serving.
 When that happens, it kills itself.

Here are the reasons for a failure to communicate in a timely fashion:
 - tablet server has swapped out
 - tablet server needs to do a stop-the-world garbage collection
 - zookeeper swaps out

The linux kernel aggressively swaps out processes in order to expand the
disk cache.  The degree that it will tend to do this is controlled with the
swappiness kernel setting.  Set this to zero:

 # echo 0 >/proc/sys/vm/swappiness

Ensure that you have ample memory.

You can see the status of available memory by looking for the "gc" lines in
the tablet server debug log:

22 16:28:54,199 [tabletserver.TabletServer] DEBUG: gc ParNew=0.33(+0.01)
secs ConcurrentMarkSweep=0.01(+0.01) secs freemem=108,455,440(+43,075,312)
totalmem=132,055,040

In particular, watch for the delta "(+0.01)" numbers.  If this exceeds the
zookeeper timeout (30 seconds by default), then you will most likely lose
the server.  You will notice this happening when the freemem approaches
totalmem.

I don't have much experience running Accumulo on VMs, but I have seen VMs
have strange behavior with respect to timekeeping.  That might be another
possible culprit.

-Eric

On Fri, Mar 30, 2012 at 9:00 AM, Jared winick <jaredwinick@gmail.com> wrote:

> I am running 1.4.0 RC6 in a single server configuration on EC2. After
> over 1 day of successful MapReduce ingest, i see this as the first of
> many errors/warnings in the monitor's recent logs.
>
> "Problem getting real goal state:
> org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss for
> /accumulo/d36894a2-e760-4273-9c9f-dfa64ed8f4bc/masters/goal_state"
>
> This message is followed by attempts to reconnect to Zookeeper, and then
> finally
>
> "Lost tablet server lock (reason = SESSION_EXPIRED), exiting."
>
> Zookeeper still appears to be running at this time. Obviously running
> everything on a single VM is certainly not the ideal configuration.
> Does anyone know what the root cause of my problem is and how I can
> best avoid it happening again? Also, should i just stop and restart
> Accumulo and everything should be OK again if Zookeeper is now
> available and responsive?
>
> Thanks a lot.
>
> Jared Winick
>

Mime
View raw message