cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Christian Esken (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-13265) Expiration in OutboundTcpConnection can block the reader Thread
Date Fri, 10 Mar 2017 15:49:04 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15904949#comment-15904949
] 

Christian Esken edited comment on CASSANDRA-13265 at 3/10/17 3:48 PM:
----------------------------------------------------------------------

I am nearly done with the configuration, and have two questions about it:

1.  How to handle the default value? My approach is to pre-configure the default value in
Config:
{code}
    public static final int otc_backlog_expiration_interval_in_ms_default = 200;
    public volatile Integer otc_backlog_expiration_interval_in_ms = otc_backlog_expiration_interval_in_ms_default;
{code}

Additionally I will handle null values, that might have been set via JMX in the getter of
DatabaseDescriptor:
{code}
    public static Integer getOtcBacklogExpirationInterval()
    {
        Integer confValue = conf.otc_backlog_expiration_interval_in_ms;
        return confValue != null ? confValue : Config.otc_backlog_expiration_interval_in_ms_default;
    }
{code}
Is that OK? Should I also handle other illegal values in that getter (negative values), or
reject them in the setter?  I have not found a  code example in Cassandra that handles bad
values uniformly for MBean and Config.

2. How to read the config value? I am seeing some {{Integer.getInteger(propName, defaultValue)}},
but this looks strange to me. I think changes from JMX would not even be reflected. Thus I
am calling the getter from above: {{DatabaseDescriptor.getOtcBacklogExpirationInterval()}}.
Is the latter OK?



was (Author: cesken):
I am nearly done with the configuration, and have two questions about it:

1.  How to handle the default value? My approach is to pre-configure the default value in
Config:
{code}
    public static final int otc_backlog_expiration_interval_in_ms_default = 200;
    public volatile Integer otc_backlog_expiration_interval_in_ms = otc_backlog_expiration_interval_in_ms_default;
{code}

Additionally I will handle null values, that might have been set via JMX in the getter of
DatabaseDescriptor:
{code}
    public static Integer getOtcBacklogExpirationInterval()
    {
        Integer confValue = conf.otc_backlog_expiration_interval_in_ms;
        return confValue != null ? confValue : Config.otc_backlog_expiration_interval_in_ms_default;
    }
{code}

2. How to read the config value? I am seeing some {{Integer.getInteger(propName, defaultValue)}},
but this looks strange to me. I think changes from JMX would not even be reflected. Thus I
am calling the getter from above: {{DatabaseDescriptor.getOtcBacklogExpirationInterval()}}.
Is the latter OK?


> Expiration in OutboundTcpConnection can block the reader Thread
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-13265
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13265
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cassandra 3.0.9
> Java HotSpot(TM) 64-Bit Server VM version 25.112-b15 (Java version 1.8.0_112-b15)
> Linux 3.16
>            Reporter: Christian Esken
>            Assignee: Christian Esken
>         Attachments: cassandra.pb-cache4-dus.2017-02-17-19-36-26.chist.xz, cassandra.pb-cache4-dus.2017-02-17-19-36-26.td.xz
>
>
> I observed that sometimes a single node in a Cassandra cluster fails to communicate to
the other nodes. This can happen at any time, during peak load or low load. Restarting that
single node from the cluster fixes the issue.
> Before going in to details, I want to state that I have analyzed the situation and am
already developing a possible fix. Here is the analysis so far:
> - A Threaddump in this situation showed  324 Threads in the OutboundTcpConnection class
that want to lock the backlog queue for doing expiration.
> - A class histogram shows 262508 instances of OutboundTcpConnection$QueuedMessage.
> What is the effect of it? As soon as the Cassandra node has reached a certain amount
of queued messages, it starts thrashing itself to death. Each of the Thread fully locks the
Queue for reading and writing by calling iterator.next(), making the situation worse and worse.
> - Writing: Only after 262508 locking operation it can progress with actually writing
to the Queue.
> - Reading: Is also blocked, as 324 Threads try to do iterator.next(), and fully lock
the Queue
> This means: Writing blocks the Queue for reading, and readers might even be starved which
makes the situation even worse.
> -----
> The setup is:
>  - 3-node cluster
>  - replication factor 2
>  - Consistency LOCAL_ONE
>  - No remote DC's
>  - high write throughput (100000 INSERT statements per second and more during peak times).
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message