cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11363) Blocked NTR When Connecting Causing Excessive Load
Date Wed, 04 May 2016 22:05:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15271519#comment-15271519
] 

Paulo Motta commented on CASSANDRA-11363:
-----------------------------------------

Tried reproducing this in a variety of workloads without success in the following environment:
* 3 nodes m3.xlarge (m3.xlarge, 4 vcpu, 15GB RAM, 2 x 40)
* 8GB Heap, CMS
* 100M keys (24GB per node)
* C* 3.0.3 with default settings and a variation with native_transport_max_threads=10
* no-vnodes

The workloads were based a variation of https://raw.githubusercontent.com/mesosphere/cassandra-mesos/master/driver-extensions/cluster-loadtest/cqlstress-example.yaml
with 100M keys and RF=3 executed from 2 m3.xlarge stress nodes for 1, 2 and 6 hours with 10,
20 and 30 threads without exhausting the cluster CPU/IO capacity. I tried several combinations
of the following workloads:
* read-only
* write-only
* range-only
* triggering repairs during the execution
* unthrottling compaction

I recorded and analyzed flight recordings during the tests but didn't find anything suspicious.
No blocked native transport threads were verified during tests with above scenarios, so this
might indicate that this condition is not a widespread bug like CASSANDRA-11529 but probably
some edgy combination of workload, environment and bad scheduling that happens in production
but is harder to reproduce with synthetic workloads.
 
A thread dump of when this condition happens would probably help us detect where is the bottleneck
or contention, so I created CASSANDRA-11713 to add ability of logging a thread dump when the
thread pool queue is full. If someone could install that patch and enable it in production
to capture a thread dump when the blockage happens that would probably help us elucidate what's
going on here.

> Blocked NTR When Connecting Causing Excessive Load
> --------------------------------------------------
>
>                 Key: CASSANDRA-11363
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11363
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Russell Bradberry
>            Assignee: Paulo Motta
>            Priority: Critical
>         Attachments: cassandra-102-cms.stack, cassandra-102-g1gc.stack
>
>
> When upgrading from 2.1.9 to 2.1.13, we are witnessing an issue where the machine load
increases to very high levels (> 120 on an 8 core machine) and native transport requests
get blocked in tpstats.
> I was able to reproduce this in both CMS and G1GC as well as on JVM 7 and 8.
> The issue does not seem to affect the nodes running 2.1.9.
> The issue seems to coincide with the number of connections OR the number of total requests
being processed at a given time (as the latter increases with the former in our system)
> Currently there is between 600 and 800 client connections on each machine and each machine
is handling roughly 2000-3000 client requests per second.
> Disabling the binary protocol fixes the issue for this node but isn't a viable option
cluster-wide.
> Here is the output from tpstats:
> {code}
> Pool Name                    Active   Pending      Completed   Blocked  All time blocked
> MutationStage                     0         8        8387821         0              
  0
> ReadStage                         0         0         355860         0              
  0
> RequestResponseStage              0         7        2532457         0              
  0
> ReadRepairStage                   0         0            150         0              
  0
> CounterMutationStage             32       104         897560         0              
  0
> MiscStage                         0         0              0         0              
  0
> HintedHandoff                     0         0             65         0              
  0
> GossipStage                       0         0           2338         0              
  0
> CacheCleanupExecutor              0         0              0         0              
  0
> InternalResponseStage             0         0              0         0              
  0
> CommitLogArchiver                 0         0              0         0              
  0
> CompactionExecutor                2       190            474         0              
  0
> ValidationExecutor                0         0              0         0              
  0
> MigrationStage                    0         0             10         0              
  0
> AntiEntropyStage                  0         0              0         0              
  0
> PendingRangeCalculator            0         0            310         0              
  0
> Sampler                           0         0              0         0              
  0
> MemtableFlushWriter               1        10             94         0              
  0
> MemtablePostFlush                 1        34            257         0              
  0
> MemtableReclaimMemory             0         0             94         0              
  0
> Native-Transport-Requests       128       156         387957        16            278451
> Message type           Dropped
> READ                         0
> RANGE_SLICE                  0
> _TRACE                       0
> MUTATION                     0
> COUNTER_MUTATION             0
> BINARY                       0
> REQUEST_RESPONSE             0
> PAGED_RANGE                  0
> READ_REPAIR                  0
> {code}
> Attached is the jstack output for both CMS and G1GC.
> Flight recordings are here:
> https://s3.amazonaws.com/simple-logs/cassandra-102-cms.jfr
> https://s3.amazonaws.com/simple-logs/cassandra-102-g1gc.jfr
> It is interesting to note that while the flight recording was taking place, the load
on the machine went back to healthy, and when the flight recording finished the load went
back to > 100.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message