cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eric Stevens <>
Subject Re: setting up prod cluster
Date Mon, 12 Jan 2015 14:27:34 GMT
Hi Tim, replies inline below.

On Sun, Jan 11, 2015 at 8:03 PM, Tim Dunphy <> wrote:

> Hey all,
>  I've been experimenting with Cassandra on a small scale and in my own
> sandbox for a while now. I'm pretty used to working with it to get small
> clusters up and running and gossiping with each other.
> But I just had a new project at work drop into my lap that requires a
> NoSQL data store. And the developers have selected... you guessed it!
> Cassasndra as their back end database.
> So I'll be asked to setup a 6 node cluster all hosted in one data center.
> I want to just make sure that I understand the concept of seeds correctly.
> I think since we'll be dealing with 6 nodes, what I'll want to do is have 2
> seeds. And have each seed seeing each other as it's own seed.
There isn't really a reason to have a seed host exclude itself from its own
seeds list.  All hosts in a cluster can share a common set of seeds.  A
typical configuration is to select three hosts from each data center,
preferably from three different racks (or AWS availability zones).  Then in
order for there to be troubles with a new host coming online, all three
hosts would have to go offline at the same time.  If a host which is coming
online can talk to even one seed, it will query that seed to find the rest
of the nodes in the cluster.

The one thing you *don't* want to do is have a host be in its own seeds
list when joining a cluster with existing data (that's a hint that a host
should consider itself authoritative on what data it already owns, and will
keep that host from bootstrapping, it'll join the cluster immediately
without learning anything about the data it's now responsible for).

> Then the other 2 nodes in each sub-group will have the IP for it's seed on
> each of it's cassandra.yml files.
I'm not really sure what you mean by sub-group here, if all six hosts are
in the same datacenter do you maybe mean you're spreading the hosts out
across several physical racks (or AWS availability zones)?  There might be
some cognative dissonance here.  Most if not all hosts in your cluster
would typically share the same seeds list.

> Then I'll want to set the replication factor to 5. Since it'll be the
> total number of nodes -1. I just want to make sure I have all that right.
RF=5 isn't necessarily *wrong*, but I have a feeling it's not what you
want.  RF doesn't usually consider how many nodes are in your cluster, it
represents your fault tolerance.

Replication Factor says how many times a single piece of data ("piece" as
determined by partition key in the table) is written to your cluster inside
of a given datacenter, with each copy going to a different physical host,
and preferring to place replicas in different physical racks if it's
possible. With RF=5, you can totally lose four nodes and still be able to
access all your data (albeit at a read/write consistency level of ONE).
You can simultaneously lose two nodes, and most clients (which tend to
prefer consistency level of quorum by default) wouldn't even notice.  A
more common RF is 3, regardless of cluster size.  This lets you totally
lose two nodes at the same time, and not lose any data.

> Another thing that will have to happen is that I will need to connect
> Cassandra into a 4 node ElasticSearch cluster. I think there are a few
> options for doing that. I've seen names like Titan and Gremlin. And I was
> wondering if anyone has any recommendations there.
I have no first hand experience on that front, but depending on your
budget, DataStax Enterprise's integrated Solr might be a better fit (it'll
be a lot less work and time).

> And lastly I'd like to point out that I know literally nothing about the
> data that will be stored there just as of yet. The first meeting about the
> project will be tomorrow. My manager gave me an advanced heads up about
> what will be required.
If this is your first Cassandra project, you should understand that
effective data modeling for Cassandra focuses very, very heavily on knowing
exactly what queries will be performed against the data.  CQL looks like
SQL, but ad hoc querying isn't practical, and typically you'll write the
same business data multiple times in multiple layouts (tables with
different partition/clustering keys), once to satisfy each specific query.
Some of my business data I write exactly the same data to 6 to 8 tables so
I can answer different classes of question.

> Thank you,
> Tim
> --
> GPG me!!
> gpg --keyserver --recv-keys F186197B

View raw message