cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Anguenot <jul...@anguenot.org>
Subject Re: Cassandra eats all cpu cores, high load average
Date Fri, 12 Feb 2016 15:44:46 GMT

> On Feb 12, 2016, at 9:24 AM, Skvazh Roman <r@skvazh.com> wrote:
> 
> I have disabled autocompaction and stop it on highload node.

Does the load decrease and the node answers requests “normally” when you do disable auto-compaction?
You actually see pending compactions on nodes having high load correct?

> Heap is 8Gb. gc_grace is 86400
> All sstables is about 200-300 Mb.

All seems legit here. Using G1 GC?

> $ nodetool compactionstats
> pending tasks: 14

Try to increase the compactors from 4 to 6-8 on a node, disable gossip and let it finish compacting
and put it back in the ring by enabling gossip. See what happens.

The tombstones count growing is because the auto-aucompactions are disabled on these nodes.
Probably not your issue.

   J.


> 
> $ dstat -lvnr 10
> ---load-avg--- ---procs--- ------memory-usage----- ---paging-- -dsk/total- ---system--
----total-cpu-usage---- -net/total- --io/total-
> 1m   5m  15m |run blk new| used  buff  cach  free|  in   out | read  writ| int   csw
|usr sys idl wai hiq siq| recv  send| read  writ
> 29.4 28.6 23.5|0.0   0 1.2|11.3G  190M 17.6G  407M|   0     0 |7507k 7330k|  13k   40k|
11   1  88   0   0   0|   0     0 |96.5  64.6
> 29.3 28.6 23.5| 29   0 0.9|11.3G  190M 17.6G  408M|   0     0 |   0   189k|9822  2319
| 99   0   0   0   0   0| 138k  120k|   0  4.30
> 29.4 28.6 23.6| 30   0 2.0|11.3G  190M 17.6G  408M|   0     0 |   0    26k|8689  2189
|100   0   0   0   0   0| 139k  120k|   0  2.70
> 29.4 28.7 23.6| 29   0 3.0|11.3G  190M 17.6G  408M|   0     0 |   0    20k|8722  1846
| 99   0   0   0   0   0| 136k  120k|   0  1.50 ^C
> 
> 
> JvmTop 0.8.0 alpha - 15:20:37,  amd64, 16 cpus, Linux 3.14.44-3, load avg 28.09
> http://code.google.com/p/jvmtop
> 
> PID 32505: org.apache.cassandra.service.CassandraDaemon
> ARGS:
> VMARGS: -ea -javaagent:/usr/share/cassandra/lib/jamm-0.3.0.jar -XX:+CMSCl[...]
> VM: Oracle Corporation Java HotSpot(TM) 64-Bit Server VM 1.8.0_65
> UP:  8:31m  #THR: 334  #THRPEAK: 437  #THRCREATED: 4694 USER: cassandra
> GC-Time:  0: 8m   #GC-Runs: 6378      #TotalLoadedClasses: 5926
> CPU: 97.96% GC:  0.00% HEAP:6049m /7540m NONHEAP:  82m /  n/a
> 
>  TID   NAME                                    STATE    CPU  TOTALCPU BLOCKEDBY
>    447 SharedPool-Worker-45                 RUNNABLE 60.47%     1.03%
>    343 SharedPool-Worker-2                  RUNNABLE 56.46%     3.07%
>    349 SharedPool-Worker-8                  RUNNABLE 56.43%     1.61%
>    456 SharedPool-Worker-25                 RUNNABLE 55.25%     1.06%
>    483 SharedPool-Worker-40                 RUNNABLE 53.06%     1.04%
>    475 SharedPool-Worker-53                 RUNNABLE 52.31%     1.03%
>    464 SharedPool-Worker-20                 RUNNABLE 52.00%     1.11%
>    577 SharedPool-Worker-71                 RUNNABLE 51.73%     1.02%
>    404 SharedPool-Worker-10                 RUNNABLE 51.10%     1.29%
>    486 SharedPool-Worker-34                 RUNNABLE 51.06%     1.03%
> Note: Only top 10 threads (according cpu load) are shown!
> 
> 
>> On 12 Feb 2016, at 18:14, Julien Anguenot <julien@anguenot.org> wrote:
>> 
>> At the time when the load is high and you have to restart, do you see any pending
compactions when using `nodetool compactionstats`?
>> 
>> Possible to see a `nodetool compactionstats` taken *when* the load is too high? 
Have you checked the size of your SSTables for that big table? Any large ones in there?  What
about the Java HEAP configuration on these nodes?
>> 
>> If you have too many tombstones I would try to decrease gc_grace_seconds so they
get cleared out earlier during compactions.
>> 
>>  J.
>> 
>>> On Feb 12, 2016, at 8:45 AM, Skvazh Roman <r@skvazh.com> wrote:
>>> 
>>> There is 1-4 compactions at that moment.
>>> We have many tombstones, which does not removed.
>>> DroppableTombstoneRatio is 5-6 (greater than 1)
>>> 
>>>> On 12 Feb 2016, at 15:53, Julien Anguenot <julien@anguenot.org> wrote:
>>>> 
>>>> Hey, 
>>>> 
>>>> What about compactions count when that is happening?
>>>> 
>>>> J.
>>>> 
>>>> 
>>>>> On Feb 12, 2016, at 3:06 AM, Skvazh Roman <r@skvazh.com> wrote:
>>>>> 
>>>>> Hello!
>>>>> We have a cluster of 25 c3.4xlarge nodes (16 cores, 32 GiB) with attached
1.5 TB 4000 PIOPS EBS drive.
>>>>> Sometimes one or two nodes user cpu spikes to 100%, load average to 20-30
- read requests drops of.
>>>>> Only restart of this cassandra services helps.
>>>>> Please advice.
>>>>> 
>>>>> One big table with wide rows. 600 Gb per node.
>>>>> LZ4Compressor
>>>>> LeveledCompaction
>>>>> 
>>>>> concurrent compactors: 4
>>>>> compactor throughput: tried from 16 to 128
>>>>> Concurrent_readers: from 16 to 32
>>>>> Concurrent_writers: 128
>>>>> 
>>>>> 
>>>>> https://gist.github.com/rskvazh/de916327779b98a437a6
>>>>> 
>>>>> 
>>>>> JvmTop 0.8.0 alpha - 06:51:10,  amd64, 16 cpus, Linux 3.14.44-3, load
avg 19.35
>>>>> http://code.google.com/p/jvmtop
>>>>> 
>>>>> Profiling PID 9256: org.apache.cassandra.service.CassandraDa
>>>>> 
>>>>> 95.73% (     4.31s) ....google.common.collect.AbstractIterator.tryToComputeN()
>>>>> 1.39% (     0.06s) com.google.common.base.Objects.hashCode()
>>>>> 1.26% (     0.06s) io.netty.channel.epoll.Native.epollWait()
>>>>> 0.85% (     0.04s) net.jpountz.lz4.LZ4JNI.LZ4_compress_limitedOutput()
>>>>> 0.46% (     0.02s) net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast()
>>>>> 0.26% (     0.01s) com.google.common.collect.Iterators$7.computeNext()
>>>>> 0.06% (     0.00s) io.netty.channel.epoll.Native.eventFdWrite()
>>>>> 
>>>>> 
>>>>> ttop:
>>>>> 
>>>>> 2016-02-12T08:20:25.605+0000 Process summary
>>>>> process cpu=1565.15%
>>>>> application cpu=1314.48% (user=1354.48% sys=-40.00%)
>>>>> other: cpu=250.67%
>>>>> heap allocation rate 146mb/s
>>>>> [000405] user=76.25% sys=-0.54% alloc=     0b/s - SharedPool-Worker-9
>>>>> [000457] user=75.54% sys=-1.26% alloc=     0b/s - SharedPool-Worker-14
>>>>> [000451] user=73.52% sys= 0.29% alloc=     0b/s - SharedPool-Worker-16
>>>>> [000311] user=76.45% sys=-2.99% alloc=     0b/s - SharedPool-Worker-4
>>>>> [000389] user=70.69% sys= 2.62% alloc=     0b/s - SharedPool-Worker-6
>>>>> [000388] user=86.95% sys=-14.28% alloc=     0b/s - SharedPool-Worker-5
>>>>> [000404] user=70.69% sys= 0.10% alloc=     0b/s - SharedPool-Worker-8
>>>>> [000390] user=72.61% sys=-1.82% alloc=     0b/s - SharedPool-Worker-7
>>>>> [000255] user=87.86% sys=-17.87% alloc=     0b/s - SharedPool-Worker-1
>>>>> [000444] user=72.21% sys=-2.30% alloc=     0b/s - SharedPool-Worker-12
>>>>> [000310] user=71.50% sys=-2.31% alloc=     0b/s - SharedPool-Worker-3
>>>>> [000445] user=69.68% sys=-0.83% alloc=     0b/s - SharedPool-Worker-13
>>>>> [000406] user=72.61% sys=-4.40% alloc=     0b/s - SharedPool-Worker-10
>>>>> [000446] user=69.78% sys=-1.65% alloc=     0b/s - SharedPool-Worker-11
>>>>> [000452] user=66.86% sys= 0.22% alloc=     0b/s - SharedPool-Worker-15
>>>>> [000256] user=69.08% sys=-2.42% alloc=     0b/s - SharedPool-Worker-2
>>>>> [004496] user=29.99% sys= 0.59% alloc=   30mb/s - CompactionExecutor:15
>>>>> [004906] user=29.49% sys= 0.74% alloc=   39mb/s - CompactionExecutor:16
>>>>> [010143] user=28.58% sys= 0.25% alloc=   26mb/s - CompactionExecutor:17
>>>>> [000785] user=27.87% sys= 0.70% alloc=   38mb/s - CompactionExecutor:12
>>>>> [012723] user= 9.09% sys= 2.46% alloc= 2977kb/s - RMI TCP Connection(2673)-127.0.0.1
>>>>> [000555] user= 5.35% sys=-0.08% alloc=  474kb/s - SharedPool-Worker-24
>>>>> [000560] user= 3.94% sys= 0.07% alloc=  434kb/s - SharedPool-Worker-22
>>>>> [000557] user= 3.94% sys=-0.17% alloc=  339kb/s - SharedPool-Worker-25
>>>>> [000447] user= 2.73% sys= 0.60% alloc=  436kb/s - SharedPool-Worker-19
>>>>> [000563] user= 3.33% sys=-0.04% alloc=  460kb/s - SharedPool-Worker-20
>>>>> [000448] user= 2.73% sys= 0.27% alloc=  414kb/s - SharedPool-Worker-21
>>>>> [000554] user= 1.72% sys= 0.70% alloc=  232kb/s - SharedPool-Worker-26
>>>>> [000558] user= 1.41% sys= 0.39% alloc=  213kb/s - SharedPool-Worker-23
>>>>> [000450] user= 1.41% sys=-0.03% alloc=  158kb/s - SharedPool-Worker-17
>>>> 
>> 
>> 
>> 
> 

--
Julien Anguenot (@anguenot)
USA +1.832.408.0344 <tel:+1.832.408.0344>  
FR +33.7.86.85.70.44


Mime
View raw message