cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danil Smirnov (JIRA)" <>
Subject [jira] [Updated] (CASSANDRA-13095) Timeouts between nodes
Date Wed, 04 Jan 2017 11:55:58 GMT


Danil Smirnov updated CASSANDRA-13095:
    Attachment:     (was: 13095-2.1.patch)

> Timeouts between nodes
> ----------------------
>                 Key: CASSANDRA-13095
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Danil Smirnov
>            Priority: Minor
>         Attachments: 13095-2.1.patch
> Recently I've run into a problem with heavily loaded cluster when sometimes messages
between certain nodes become blocked with no reason.
> It's very similar to situation described here
although in my case nodes does not seem to recover.
> Thread dump showed infinite loop here:
> Apparently the problem is in the initial value of epoch filed in TimeHorizonMovingAverageCoalescingStrategy
class. When it's value is not evenly divisible by BUCKET_INTERVAL, ix(epoch-1) does not point
to the correct bucket. As a result, sum gradually increases and, upon reaching MEASURED_INTERVAL,
averageGap becomes 0 and thread blocks.
> It's hard to reproduce because when no messages are send for some time, sum becomes 0
. So this bug affects only heavily loaded clusters.
> I've added a patch which should fix the problem. Don't know if it would be of any help
since CASSANDRA-12676 will apparently disable this behaviour. One note about performance though.
Due to incorrect averageGap calculation, currently TIMEHORIZON sleeps a lot more than reasonable,
so it might be worth testing performance after fixes and/or tune the algorithm.

This message was sent by Atlassian JIRA

View raw message