cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Corentin Chary (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13039) Mutation time mostly spent in LinkedBlockingQueue.put()
Date Wed, 28 Dec 2016 08:31:58 GMT


Corentin Chary commented on CASSANDRA-13039:

Enabling otc_coalescing_window_us seems to create "backlog" of death scenarios where most
of the time is spent in ExpireMessages() because the backlog becomes huge. The consumer is
never able to cope with the hundreds of producers.

The new backpressure mechanism could be a solution to that but it seems too aggressive, and
isn't enabled by default.

Another issue is that multiple different things are run on Stage.MUTATION: performing the
local mutation and executing Verb.MUTATION (which will itself schedule its own local mutation
on Stage.MUTATION, and there is probably a risk of deadlock here).

A solution to that could be to run only the local mutations on Stage.MUTATION. I think this
is similar to what is done by counters.

> Mutation time mostly spent in LinkedBlockingQueue.put()
> -------------------------------------------------------
>                 Key: CASSANDRA-13039
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Coordination
>            Reporter: Corentin Chary
>         Attachments: mutation-linkedlist-block.png, profiler-snapshot.nps
> On a setup with a sustained write load of 70kQPS per node and a RF of 2 it looks like
most of the mutation time is spend in OutboundTcpConnection.enqueue() -> backlog.put()
> backlog is an unbounded LinkedBlockingQueue, which means that .put() can only be blocking
if a lock is taken. I strongly suspect that this is caused by the use of drainTo() in CoalescingStrategies
which is causing contention for the producers.
> On the other hand, not using drainTo() could lead to starvation of the consumers.
> Possible solutions:
> - Allow multiple connections per size and per hosts in OutboundTcpConnectionPool
> - Switch from drainTo to multiple take()
> - Switch to ConcurrentLinkedQueue (which is lockless), also means we need active polling.
> Maybe a good solution would be something hybrid: a bounded LinkedBlockingQueue and an
unbounded ConcurrentLinkedQueue. This way you get low latency when you don't have a lot of
messages, and throughput when you do.

This message was sent by Atlassian JIRA

View raw message