cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeffrey F. Lukman (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-11724) False Failure Detection in Big Cassandra Cluster
Date Sat, 07 May 2016 21:18:12 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-11724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15275388#comment-15275388
] 

Jeffrey F. Lukman edited comment on CASSANDRA-11724 at 5/7/16 9:17 PM:
-----------------------------------------------------------------------

{quote}
Are you not waiting two minutes between starting each node to join the ring?
{quote}

Hi Jeremy,

No, I haven't waited for 2 minutes for starting each node. So do you say, even when I want
to bootstrap a new cluster, let's say I want to bootstrap a 512-nodes cluster, I have to add
the nodes one by one and between adding the nodes, I have to wait for 2 minutes? It means,
when I want to bootstrap a 512-nodes, I need to wait for 511 * 2 minutes = 1022 minutes =
*17 hours*?

What about the decommission one node problem?
In our 4th workload, our workload shown that we first started X nodes, we waited until it
is stable, then we decommissioned a node.
And we still see a big number of false failure detection. For 512 nodes, we measured around
*90,000+* false failure detection.


was (Author: jeffreyflukman):
{quote}
Are you not waiting two minutes between starting each node to join the ring?
{quote}

Hi Jeremy,

No, I haven't waited for 2 minutes for starting each node. So do you say, even when I want
to bootstrap a new cluster, let's say I want to bootstrap a 512-nodes cluster, I have to add
the nodes one by one and between adding the nodes, I have to wait for 2 minutes? It means,
when I want to bootstrap a 512-nodes, I need to wait for 511 * 2 minutes = 1022 minutes =
*17+ hours*?

What about the decommission one node problem?
In our 4th workload, our workload shown that we first started X nodes, we waited until it
is stable, then we decommissioned a node.
And we still see a big number of false failure detection. For 512 nodes, we measured around
*90,000+* false failure detection.

> False Failure Detection in Big Cassandra Cluster
> ------------------------------------------------
>
>                 Key: CASSANDRA-11724
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11724
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jeffrey F. Lukman
>              Labels: gossip, node-failure
>         Attachments: Workload1.jpg, Workload2.jpg, Workload3.jpg, Workload4.jpg, experiment-result.txt
>
>
> We are running some testing on Cassandra v2.2.5 stable in a big cluster. The setting
in our testing is that each machine has 16-cores and runs 8 cassandra instances, and our testing
is 32, 64, 128, 256, and 512 instances of Cassandra. We use the default number of vnodes for
each instance which is 256. The data and log directories are on in-memory tmpfs file system.
> We run several types of workloads on this Cassandra cluster:
> Workload1: Just start the cluster
> Workload2: Start half of the cluster, wait until it gets into a stable condition, and
run another half of the cluster
> Workload3: Start half of the cluster, wait until it gets into a stable condition, load
some data, and run another half of the cluster
> Workload4: Start the cluster, wait until it gets into a stable condition, load some data
and decommission one node
> For this testing, we measure the total numbers of false failure detection inside the
cluster. By false failure detection, we mean that, for example, instance-1 marks the instance-2
down, but the instance-2 is not down. We dig deeper into the root cause and find out that
instance-1 has not received any heartbeat after some time from instance-2 because the instance-2
run a long computation process.
> Here I attach the graphs of each workload result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message