cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Danil Smirnov (JIRA)" <>
Subject [jira] [Created] (CASSANDRA-13095) Timeouts between nodes
Date Wed, 04 Jan 2017 09:13:58 GMT
Danil Smirnov created CASSANDRA-13095:

             Summary: Timeouts between nodes
                 Key: CASSANDRA-13095
             Project: Cassandra
          Issue Type: Bug
            Reporter: Danil Smirnov
            Priority: Minor

Recently I've run into a problem with heavily loaded cluster when sometimes messages between
certain nodes become blocked with no reason.

It's very similar to situation described here
although in my case nodes does not seem to recover.

Thread dump showed infinite loop here:

Apparently the problem is an initial value of epoch filed in TimeHorizonMovingAverageCoalescingStrategy
class. When it's value is not evenly divisible by BUCKET_INTERVAL, ix(epoch-1) does not point
to the correct bucket. As a result, sum gradually increases and, upon reaching MEASURED_INTERVAL,
averageGap becomes 0 and thread blocks.
It's hard to reproduce because when no messages are send for some time, sum becomes 0
. So this bug affects only heavily loaded clusters.

I've added a patch which should fix the problem. Don't know if it would be of any help since
CASSANDRA-12676 will apparently disable this behaviour. One note about performance thought.
Due to incorrect averageGap calculation, currently TIMEHORIZON sleeps a lot more than reasonable,
so it might be worth testing performance after fixes and/or tune the algorithm.

This message was sent by Atlassian JIRA

View raw message