cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ryan Fowler (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-6244) calculatePendingRanges could be asynchronous on 1.2 too
Date Mon, 28 Oct 2013 22:44:31 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-6244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Ryan Fowler updated CASSANDRA-6244:
-----------------------------------

    Attachment: escalating-phi.txt

I think calculatePendingRanges is probably "fast enough" when the number of keyspaces is small.

I've attached a log file that I think demonstrates what's going on for us. In a 3 node cluster,
I created 30 keyspaces (no column families). I decommissioned 172.17.0.28 and attached (grepped)
debug logs from 172.17.0.26.

What you see, is interleaved "Pending ranges:" lines from calculatePendingRanges interleaved
with "Sending a GossipDigestSyn" messages and "PHI for" messages.

The PHI for both .27 and .28 rise, because no incoming gossip is being processed. .27 and
.28 both get marked DOWN, but the calculatePendingRange calls continue for a while. Eventually
things come back together, but clients get UnavailableExceptions until they do.

I used Docker to reproduce the problem, but our EC2 infrastructure is seeing essentially the
same thing (with a few less keyspaces).


> calculatePendingRanges could be asynchronous on 1.2 too
> -------------------------------------------------------
>
>                 Key: CASSANDRA-6244
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6244
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>         Environment: Cassandra 1.2, AWS
>            Reporter: Ryan Fowler
>             Fix For: 1.2.12
>
>         Attachments: 6244.txt, escalating-phi.txt
>
>
> calculatePendingRanges can hang up the Gossip thread to the point of a node marking all
the other nodes down.
> I noticed that the same problem was resolved with CASSANDRA-5135, so I attempted to port
the patch from that issue to the 1.2 codebase.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message