accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Busbey <busbey...@clouderagovt.com>
Subject Re: Tserver kills themselves from lost Zookeeper locks
Date Mon, 18 Nov 2013 00:20:06 GMT
You may also want to try decreasing your memory allocations. For example,
if you are using native maps try just giving the tablet server 1-2GB of
memory in accumulo-env.sh.

Can you find the following:

accumulo-env.sh:

* Tablet Server Java max heap

accumulo-site.xml:
* tserver.memory.maps.max
* tserver.cache.data.size
* tserver.cache.index.size

In the past, the two most common misconfigurations I've seen are

1) Using native maps (the default) while not taking into account the memory
required by them (tserver.memory.maps.max).

This usually comes up when the host starts swapping under pressure.

2) giving tablet servers lots of memory in accumulo-env, but not increasing
the cache sizes

This usually happens when people leave their accumulo-site.xml as defaults
but give the tablet server 12-16GB of memory in accumulo-env.sh, this means
that the JVM just burns extra space and then when GC finally happens it
causes time outs.

Can you get us full logs? Pref for both the Accumulo bits and ZooKeeper?



On Sat, Nov 16, 2013 at 8:42 PM, John Vines <vines@apache.org> wrote:

> As Eric suggested before, make sure things aren't being pushed into swap.
> An 11 seconds delay is definitely indicative of it. Be sure to ensure both
> the tserver process itself as well as your VM as a whole is not swapping.
> Mind you, things may swap not wholly as a symptom of full physical memory,
> but in an attempt for the os to be "helpful".
>
> Sent from my phone, please pardon the typos and brevity.
> On Nov 16, 2013 9:07 PM, "buttercream" <buttercreamanonymous@gmail.com>
> wrote:
>
>> I did not omit any log messages.
>>
>> The interesting thing is that the query load really isn't that bad (or at
>> least I perceive it to not be bad). I'm just doing direct lookups on
>> individual rows based on rowID. At most that would be about 2k at a time.
>> I
>> would need to dig through some other logs to see if I could see if there
>> was
>> an actual batch scan happening at that time or not. I usually don't
>> realize
>> there is a problem until the system stops responding and I check the
>> master
>> log and see that it shows no tablet servers running.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-accumulo.1065345.n5.nabble.com/Tserver-kills-themselves-from-lost-Zookeeper-locks-tp6125p6485.html
>> Sent from the Users mailing list archive at Nabble.com.
>>
>


-- 
Sean

Mime
View raw message