cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corentin Chary (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13039) Mutation time mostly spent in LinkedBlockingQueue.put()
Date Tue, 27 Dec 2016 10:30:58 GMT


Corentin Chary commented on CASSANDRA-13039:

We did some additional debugging:
Most of the time *seem* to be spend in "signalNotEmpty()" (in LinkedBlockingQueue) when it's
trying to unpark on of the reader threads. Looking at the metrics it seems that the backlog
is always empty, and that the system is doing a *lot* of context switches to ensure that.
A workaround that seems to work is to set otc_coalescing_window_us (and otc_coalescing_strategy)
to make sure that the backlog doesn't stay empty.

> Mutation time mostly spent in LinkedBlockingQueue.put()
> -------------------------------------------------------
>                 Key: CASSANDRA-13039
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Corentin Chary
>         Attachments: mutation-linkedlist-block.png, profiler-snapshot.nps
> On a setup with a sustained write load of 70kQPS per node and a RF of 2 it looks like
most of the mutation time is spend in OutboundTcpConnection.enqueue() -> backlog.put()
> backlog is an unbounded LinkedBlockingQueue, which means that .put() can only be blocking
if a lock is taken. I strongly suspect that this is caused by the use of drainTo() in CoalescingStrategies
which is causing contention for the producers.
> On the other hand, not using drainTo() could lead to starvation of the consumers.
> Possible solutions:
> - Allow multiple connections per size and per hosts in OutboundTcpConnectionPool
> - Switch from drainTo to multiple take()
> - Switch to ConcurrentLinkedQueue (which is lockless), also means we need active polling.
> Maybe a good solution would be something hybrid: a bounded LinkedBlockingQueue and an
unbounded ConcurrentLinkedQueue. This way you get low latency when you don't have a lot of
messages, and throughput when you do.

This message was sent by Atlassian JIRA

View raw message