Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Fri, 6 Jan 2017 23:54:58 +0000 (UTC)
From: "Danil Smirnov (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.13031904.1483521227000.673596.1483746898783@Atlassian.JIRA>
In-Reply-To: <JIRA.13031904.1483521227000@Atlassian.JIRA>
References: <JIRA.13031904.1483521227000@Atlassian.JIRA> <JIRA.13031904.1483521227840@arcas>
Subject: [jira] [Updated] (CASSANDRA-13095) Timeouts between nodes
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Fri, 06 Jan 2017 23:55:00 -0000


     [ https://issues.apache.org/jira/browse/CASSANDRA-13095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Danil Smirnov updated CASSANDRA-13095:
--------------------------------------
    Attachment: 13095-2.1.patch

> Timeouts between nodes
> ----------------------
>
>                 Key: CASSANDRA-13095
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13095
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Danil Smirnov
>            Priority: Minor
>         Attachments: 13095-2.1.patch
>
>
> Recently I've run into a problem with heavily loaded cluster when sometimes messages between certain nodes become blocked with no reason.
> It looks like the same situation that described here https://issues.apache.org/jira/browse/CASSANDRA-12676?focusedCommentId=15736166&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15736166
> Thread dump showed infinite loop here: https://github.com/apache/cassandra/blob/a8a43dd32eb92406d7d8b105e08c68b3d5c7df49/src/java/org/apache/cassandra/utils/CoalescingStrategies.java#L109
> Apparently the problem is in the initial value of epoch filed in TimeHorizonMovingAverageCoalescingStrategy class. When it's value is not evenly divisible by BUCKET_INTERVAL, ix(epoch-1) does not point to the correct bucket. As a result, sum gradually increases and, upon reaching MEASURED_INTERVAL, averageGap becomes 0 and thread blocks.
> It's hard to reproduce because it takes a long time for sum to grow and when no messages are send for some time, sum becomes 0 https://github.com/apache/cassandra/blob/a8a43dd32eb92406d7d8b105e08c68b3d5c7df49/src/java/org/apache/cassandra/utils/CoalescingStrategies.java#L301 and bug is no longer reproducible (until connection between nodes is re-created).
> I've added a patch which should fix the problem. Don't know if it would be of any help since CASSANDRA-12676 will apparently disable this behaviour. One note about performance regressions though. There is a small chance it being result of the bug described here, so it might be worth testing performance after fixes and/or tuning the algorithm.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)