cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey F. Lukman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
Date Mon, 09 May 2016 21:45:13 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277100#comment-15277100
] 

Jeffrey F. Lukman commented on CASSANDRA-11724:
-----------------------------------------------

[~jeromatron] : okay, I will try this again and report the result later whether this config

will cause a different result or not.

For now, can you help me by confirming  whether you also see the Workload-4 bug or not?
The Workload-4 : running 512-nodes cluster with some data, then we decommissioned a node.
In our place, we see a high numbers of wrong false failure detection.

> False Failure Detection in Big Cassandra Cluster
> ------------------------------------------------
>
>                 Key: CASSANDRA-11724
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jeffrey F. Lukman
>              Labels: gossip, node-failure
>         Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The setting
in our testing is that each machine has 16-cores and runs 8 cassandra instances, and our testing
is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default number of vnodes for
each instance which is 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable condition, and
run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable condition, load
some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, load some data
and decommission one node
> For this testing, we measure the total numbers of false failure detection inside the
cluster. By false failure detection, we mean that, for example, instance-1 marks the instance-2
down, but the instance-2 is not down. We dig deeper into the root cause and find out that
instance-1 has not received any heartbeat after some time from instance-2 because the instance-2
run a long computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message