accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jared winick <jaredwin...@gmail.com>
Subject Re: Zookeeper ConnectionLossException
Date Fri, 30 Mar 2012 17:03:19 GMT
Thanks for the detailed response. I will take steps to ensure I have
enough memory and that things aren't getting swapped out. I may move
Zookeeper to it own micro instance to make sure it isn't impacted by
all the other Accumulo and Hadoop processes.

On Fri, Mar 30, 2012 at 8:31 AM, Eric Newton <eric.newton@gmail.com> wrote:
> If a client fails to communicate with zookeeper for long enough, it loses
> its lock, and loses its exclusive access to the tablets it was serving.
>  When that happens, it kills itself.
>
> Here are the reasons for a failure to communicate in a timely fashion:
>  - tablet server has swapped out
>  - tablet server needs to do a stop-the-world garbage collection
>  - zookeeper swaps out
>
> The linux kernel aggressively swaps out processes in order to expand the
> disk cache.  The degree that it will tend to do this is controlled with the
> swappiness kernel setting.  Set this to zero:
>
>  # echo 0 >/proc/sys/vm/swappiness
>
> Ensure that you have ample memory.
>
> You can see the status of available memory by looking for the "gc" lines in
> the tablet server debug log:
>
> 22 16:28:54,199 [tabletserver.TabletServer] DEBUG: gc ParNew=0.33(+0.01)
> secs ConcurrentMarkSweep=0.01(+0.01) secs freemem=108,455,440(+43,075,312)
> totalmem=132,055,040
>
> In particular, watch for the delta "(+0.01)" numbers.  If this exceeds the
> zookeeper timeout (30 seconds by default), then you will most likely lose
> the server.  You will notice this happening when the freemem approaches
> totalmem.
>
> I don't have much experience running Accumulo on VMs, but I have seen VMs
> have strange behavior with respect to timekeeping.  That might be another
> possible culprit.
>
> -Eric
>
> On Fri, Mar 30, 2012 at 9:00 AM, Jared winick <jaredwinick@gmail.com> wrote:
>>
>> I am running 1.4.0 RC6 in a single server configuration on EC2. After
>> over 1 day of successful MapReduce ingest, i see this as the first of
>> many errors/warnings in the monitor's recent logs.
>>
>> "Problem getting real goal state:
>> org.apache.zookeeper.KeeperException$ConnectionLossException:
>> KeeperErrorCode = ConnectionLoss for
>> /accumulo/d36894a2-e760-4273-9c9f-dfa64ed8f4bc/masters/goal_state"
>>
>> This message is followed by attempts to reconnect to Zookeeper, and then
>> finally
>>
>> "Lost tablet server lock (reason = SESSION_EXPIRED), exiting."
>>
>> Zookeeper still appears to be running at this time. Obviously running
>> everything on a single VM is certainly not the ideal configuration.
>> Does anyone know what the root cause of my problem is and how I can
>> best avoid it happening again? Also, should i just stop and restart
>> Accumulo and everything should be OK again if Zookeeper is now
>> available and responsive?
>>
>> Thanks a lot.
>>
>> Jared Winick
>
>

Mime
View raw message