accumulo-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "mohit.kaushik" <mohit.kaus...@orkash.com>
Subject Re: Unable to write data, tablet servers lose there locks
Date Fri, 06 Nov 2015 11:24:31 GMT
  Eric/Josef,

The issue is resoved now, You were right, I think the OS swapout the 
tservers as GC was not working properly. It had a conflicting port with 
some other service as I recently made some changes and I also have 
increased GC heap memory limit. And yes my Monitor was running on 
192.168.10.124 :) .

Thanks

On 11/05/2015 07:46 PM, Josef Roehrl - PHEMI wrote:
> Everything else not withstanding, if you see any swap space being 
> used, you need to adjust things to prevent swapping first.
>
> My 2 cents.
>
> On Thu, Nov 5, 2015 at 2:12 PM, Eric Newton <eric.newton@gmail.com 
> <mailto:eric.newton@gmail.com>> wrote:
>
>     Comments inline:
>
>     On Thu, Nov 5, 2015 at 2:18 AM, mohit.kaushik
>     <mohit.kaushik@orkash.com <mailto:mohit.kaushik@orkash.com>> wrote:
>
>
>         I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 )
>         which was working fine before I ran into this issue. whenever
>         I start writing data with a batchwritter, tablet servers loses
>         there locks one by one. I found in zookeeper logs repeatedly
>         trying and closing socket connection for servers and log has
>         infinite repetitions of following line.
>
>
>     By far, the most common reason why locks are lost is due to java
>     gc pauses.  In turn, these pauses are almost always due to memory
>     pressure within the entire system. The OS sees a nice big hunk of
>     memory in the tserver and swaps it out. Over the years we've tuned
>     various settings to prevent this, and other memory-hogging, but if
>     you are pushing the system hard, you may have to tune your
>     existing memory settings.
>
>     The tserver occasionally prints some gc stats in the debug log. If
>     you see a >30s pause between these messages, memory pressure is
>     probably the problem.
>
>
>         2015-11-05 12:11:23,860 [myid:3] - INFO
>         [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197
>         <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197>] -
>         Accepted socket connection from /192.168.10.124:47503
>         <http://192.168.10.124:47503>
>         2015-11-05 12:11:23,861 [myid:3] - INFO
>         [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827
>         <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827>] - Processing
>         stat command from /192.168.10.124:47503
>         <http://192.168.10.124:47503>
>         2015-11-05 12:11:23,869 [myid:3] - INFO
>         [Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
>         2015-11-05 12:11:23,870 [myid:3] - INFO
>         [Thread-244:NIOServerCnxn@1007] - Closed socket connection for
>         client /192.168.10.124:47503 <http://192.168.10.124:47503> (no
>         session established for client)
>
>
>     Yes, this is quite annoying: you get these messages when the
>     monitor grabs the zookeeper status EVERY 5s.  Your monitor is
>     running on 192.168.10.124. right?
>
>     These messages are expected.
>
>         I found it similar to ZOOKEEPER-832 if it is. There is one
>         thread discussing on socket connection but it do not provide
>         much help in my
>         case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3CCAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com%3E
>         <mailto:case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3CCAM1_12YvaXoe+KQ9-qCqTpv1VEGpwQvTkhn3iCTiFw6VQ7Lm0w@mail.gmail.com%3E>
>
>         There are no exceptions in tserver logs and tablet servers
>         simply lose there locks.
>
>
>     Ah, is it possible the JVM is killing itself because GC overhead
>     is climbing too high? You can check the .out (or .err) file for
>     this error.
>
>          I can scan data without any problem/exception. I need to know
>         what is the cause of the problem and work around. Would
>         upgrading resolve the issue or it needs some configuration
>         changes.
>
>
>     Check all your system processes. I know old versions of the SNMP
>     servers would leak resources, putting memory pressure on the
>     system after a few months.  Check to see if your tserver is
>     approximately the size you need. If you aren't already doing it,
>     you will want to monitor system memory/swap usage, and see if it
>     correlates to the lost servers.  Zookeeper itself is also subject
>     to gc pauses, so they can die from the same cause, although it's a
>     much smaller process.
>
>         My current zoo.cfg is as follows.
>
>         clientPort=2181
>         syncLimit=5
>         tickTime=2000
>         initLimit=10
>         maxClientCnxn=100
>
>
>     That's all fine, but you may want to turn on the zookeeper clean-up:
>
>     http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_advancedConfiguration
>
>
>     Search for "autopurge".
>
>
>         I can upload full logs if anyone needs. Please do let me know
>         if you need any other info.
>
>
>     How much memory is allocated to the various processes? Do you have
>     swap turned on? Do you see the delay in the debug GC messages?
>
>     You could try turning off swap, so the OS will kill your process
>     instead of killing itself. :-)
>
>     -Eric
>
>
>
>
> -- 
>>
>> Josef Roehrl
>> Senior Software Developer
>> *PHEMI Systems*
>> 180-887 Great Northern Way
>> Vancouver, BC V5T 4T5
>> 604-336-1119
>> Website <http://www.phemi.com/> Twitter 
>> <https://twitter.com/PHEMISystems> Linkedin 
>> <http://www.linkedin.com/company/3561810?trk=tyah&amp;trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>

>>

Mime
View raw message