accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Newton <eric.new...@gmail.com>
Subject Re: Unable to write data, tablet servers lose there locks
Date Thu, 05 Nov 2015 12:12:56 GMT
Comments inline:

On Thu, Nov 5, 2015 at 2:18 AM, mohit.kaushik <mohit.kaushik@orkash.com>
wrote:

>
> I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 ) which was
> working fine before I ran into this issue. whenever I start writing data
> with a batchwritter, tablet servers loses there locks one by one. I found
> in zookeeper logs repeatedly trying and closing socket connection for
> servers and log has infinite repetitions of following line.
>

By far, the most common reason why locks are lost is due to java gc
pauses.  In turn, these pauses are almost always due to memory pressure
within the entire system. The OS sees a nice big hunk of memory in the
tserver and swaps it out. Over the years we've tuned various settings to
prevent this, and other memory-hogging, but if you are pushing the system
hard, you may have to tune your existing memory settings.

The tserver occasionally prints some gc stats in the debug log. If you see
a >30s pause between these messages, memory pressure is probably the
problem.


>
> 2015-11-05 12:11:23,860 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
> connection from /192.168.10.124:47503
> 2015-11-05 12:11:23,861 [myid:3] - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing stat command from /
> 192.168.10.124:47503
> 2015-11-05 12:11:23,869 [myid:3] - INFO
> [Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
> 2015-11-05 12:11:23,870 [myid:3] - INFO  [Thread-244:NIOServerCnxn@1007]
> - Closed socket connection for client /192.168.10.124:47503 (no session
> established for client)
>

Yes, this is quite annoying: you get these messages when the monitor grabs
the zookeeper status EVERY 5s.  Your monitor is running on 192.168.10.124.
right?

These messages are expected.


> I found it similar to ZOOKEEPER-832 if it is. There is one thread
> discussing on socket connection but it do not provide much help in my
> case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3CCAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com%3E
>
> There are no exceptions in tserver logs and tablet servers simply lose
> there locks.
>

Ah, is it possible the JVM is killing itself because GC overhead is
climbing too high? You can check the .out (or .err) file for this error.


>  I can scan data without any problem/exception. I need to know what is the
> cause of the problem and work around. Would upgrading resolve the issue or
> it needs some configuration changes.
>

Check all your system processes. I know old versions of the SNMP servers
would leak resources, putting memory pressure on the system after a few
months.  Check to see if your tserver is approximately the size you need.
If you aren't already doing it, you will want to monitor system memory/swap
usage, and see if it correlates to the lost servers.  Zookeeper itself is
also subject to gc pauses, so they can die from the same cause, although
it's a much smaller process.



> My current zoo.cfg is as follows.
>
> clientPort=2181
> syncLimit=5
> tickTime=2000
> initLimit=10
> maxClientCnxn=100
>

That's all fine, but you may want to turn on the zookeeper clean-up:

http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_advancedConfiguration


Search for "autopurge".


>
> I can upload full logs if anyone needs. Please do let me know if you need
> any other info.
>

How much memory is allocated to the various processes? Do you have swap
turned on? Do you see the delay in the debug GC messages?

You could try turning off swap, so the OS will kill your process instead of
killing itself. :-)

-Eric

Mime
View raw message