cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Darla Baker (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-6127) vnodes don't scale to hundreds of nodes
Date Tue, 01 Oct 2013 23:41:24 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-6127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13783481#comment-13783481
] 

Darla Baker commented on CASSANDRA-6127:
----------------------------------------

Per Jonathan's request, I'm adding an update here regarding eBay's experience on https://support.datastax.com/tickets/6928
which was the result of first stage of executing the plan from https://support.datastax.com/requests/6636.

He had an existing 32 cluster DSE 3.1.0 cluster in their PHX data center.  Their plan was
to add a second data center to the cluster in SLC with 50 nodes and vnodes enabled.  They
were to begin with bringing all nodes up with auto bootstrapping turned off to prevent any
data streaming until they were ready to make other changes to bring the data center fully
online.

Essentially immediately upon bringing the nodes up in SLC, the nodes in PHX began reporting
as down and he began receiving SMS messages and calls from application engineers that the
application which uses that cluster was down.

As we were in triage mode, the most expedient course of action was to shut down the SLC nodes
and remove them from gossip.  Upon trying to execute the nodetool removenode command we hit
CASSANDRA-5857 although we thought up to this point that nodetool decommission was responsible
for the issue.  In any case, we started the process of executing the workaround as per that
ticket.  At the point we parted, the process was going slowly but he reported it was working
and the nodes were disappearing from the ring and the application engineers were reporting
that the application was back online.

At some point during the weekend, Alex reached out to Jeremy who was on call and Jeremy who
was able to finally get the nodes removed from gossip and fully stabilize the 32 node PHX
data center and fully decommission the SLC data center.

Alex attached some logs to the ticket during the event.  We were seeing node flapping and
NPEs during the event.

Ticket https://support.datastax.com/tickets/6917 contains some additional details on the test
cases.

Ticket https://support.datastax.com/tickets/6939 contains the alternate plan that eBay is
considering in light of the difficulties encountered with bringing SLC online.

> vnodes don't scale to hundreds of nodes
> ---------------------------------------
>
>                 Key: CASSANDRA-6127
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6127
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Any cluster that has vnodes and consists of hundreds of physical
nodes.
>            Reporter: Tupshin Harper
>            Assignee: Jonathan Ellis
>
> There are a lot of gossip-related issues related to very wide clusters that also have
vnodes enabled. Let's use this ticket as a master in case there are sub-tickets.
> The most obvious symptom I've seen is with 1000 nodes in EC2 with m1.xlarge instances.
Each node configured with 32 vnodes.
> Without vnodes, cluster spins up fine and is ready to handle requests within 30 minutes
or less. 
> With vnodes, nodes are reporting constant up/down flapping messages with no external
load on the cluster. After a couple of hours, they were still flapping, had very high cpu
load, and the cluster never looked like it was going to stabilize or be useful for traffic.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message