cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From William Oberman <>
Subject Re: advice for EC2 deployment
Date Wed, 27 Apr 2011 13:06:49 GMT
It's great advice, but I'm still torn.  I've never done multi-region work
before, and I'd prefer to wait for 0.8 with built-in inter-node security,
but I'm otherwise ready to roll (and need to roll) cassandra out sooner than

Given how well my system held up with a total single AZ failure, I'm really
leaning on starting by treating AZ's as DCs, and racks as... random?  I
don't think that part matters.  My question for today is to just use the
property file snitch, or to roll my own version of Ec2Snith that does AZ as

I do increase my risk being single region to start, so I was going to figure
out how to push snapshots to S3.  One question on that note: is it better to
try and snapshot all nodes at roughly the same point in time, or is it
better to do "rolling snapshots"?


On Wed, Apr 27, 2011 at 7:13 AM, aaron morton <>wrote:

> Using the EC2Snitch you could have one AZ in us-east-1 and one Az in
> us-west-1, treat each AZ as a single rack and each region as a DC. The
> network topology is rack aware so will prefer request that go to the same
> rack (not much of an issue when you have only one rack).
> If possible I would use the same RF in each DC, if you want the fail over
> to be as clean as possible (see earlier comments about when number failed
> nodes in a dc). i.e. 3 replicas in each dc / region.
> Until you find a reason otherwise use LOCAL_QUORUM, if that proves to be
> too slow or you get more experience and feel comfortable with the trade offs
> then change to a lower CL.
> Dropping the CL level for write bursts does not make the cluster run any
> faster, it lets the client think the cluster is running faster and can
> result in the client overloading (in a good "this is what it should do" way)
> the cluster. This can result in more "eventual consistency" work to be done
> later during maintenance or read requests. If that is a reasonable trade
> off, you can write at CL ONE and read at CL ALL to ensure you get consistent
> reads (quorum is not good enough in that case).
> Jump in and test it at Quorum, you may find the write performance is good
> enough. There are lots of dials to play with
> Hope that helps.
> Aaron
> On 27 Apr 2011, at 09:31, William Oberman wrote:
> I see what you're saying.  I was able to control write latency on mysql
> using insert vs insert delayed (what I feel is MySQLs poor man's eventual
> consistency option) + the fact that replication was a background
> asynchronous process.  In terms of read latency, I was able to do up to a
> few hundred well indexed mysql queries (across AZs) on a view while keeping
> the overall latency of the page around or less than a second.
> I basically am replacing two use cases, the cases with difficult to scale
> anticipated write volumes.  The first case was previously using insert
> delayed (which I'm doing in cassandra as ONE) as I wasn't getting consistent
> write/read operations before anyways.  The second case was using traditional
> insert (which I was going to replace with some QUORUM-like level, I was
> assuming LOCAL_QUORUM).  But, the latter case uses a write through memory
> cache (memcache), so I don't know how often it really reads data from the
> persistent store.  But I definitely need to make sure it is consistent.
> In any case, it sounds like I'd be best served treating AZs as DCs, but
> then I don't know what to make racks?  Or do racks not matter in a single
> AZ?  That way I can get an ack from a LOCAL_QUORUM read/write before the
> (slightly) slower read/write to/from the other AZ (for redundancy).  Then
> I'm only screwed if Amazon has a multi-AZ failure (so far, they've kept it
> to "only" one!) :-)
> will
> On Tue, Apr 26, 2011 at 5:01 PM, aaron morton <>wrote:
>> One difference between Cassandra and MySQL replication may be when the
>> network IO happens. Was the MySQL replication synchronous on transaction
>> commit ?  I was only aware that it had async replication, which means the
>> client is not exposed to the network latency. In cassandra the network
>> latency is exposed to the client as it needs to wait for the CL number of
>> nodes to respond.
>> If you use the PropertyFilePartitioner with the NetworkTopology you can
>> manually assign machines to racks / dc's based on IP.
>> See conf/ file there is also an Ec2Snitch which
>> (from the code)
>> /**
>>  * A snitch that assumes an EC2 region is a DC and an EC2
>> availability_zone
>>  *  is a rack. This information is available in the config for the node.
>> Recent discussion on DC aware CL levels
>> Hope that helps.
>>  <>
>> Aaron
>> On 27 Apr 2011, at 01:18, William Oberman wrote:
>> Thanks Aaron!
>> Unless no one on this list uses EC2, there were a few minor troubles end
>> of last week through the weekend which taught me a lot about obscure failure
>> modes in various applications I use :-)  My original post was trying to be
>> more redundant than fast, which has been by overall goal from even before
>> moving to Cassandra (my downtime from the EC2 madness was minimal, and due
>> to only having one single point of failure == the amazon load balancer).  My
>> secondary goal was  trying to make moving to a second region easier, but is
>> that is causing problems I can drop the idea.
>> I might be downplaying the cost of inter-AZ communication, but I've lived
>> with that for quite some time, for example my current setup of MySQL in
>> Master-Master replication is split over zones, and my webservers live in yet
>> different zones.  Maybe Cassandra is "chattier" than I'm used to?  (again,
>> I'm fairly new to cassandra)
>> Based on that article, the discussion, and the recent EC2 issues, it
>> sounds like it would be better to start with:
>> -6 nodes split in two AZs 3/3
>> -Configure replication to do 2 in one AZ and one in the other
>> (NetworkTopology treats AZs as racks, so does RF=3,us-east=3 make this
>> happen naturally?)
>> -What does LOCAL_QUORUM do in this case?  Is there a "rack quorum"?  Or
>> does the natural latencies of AZs make LOCAL_QUORUM behave like a rack
>> quorum?
>> will
>> On Tue, Apr 26, 2011 at 1:14 AM, aaron morton <>wrote:
>>> For background see this article:
>>> <>And
>>> this recent discussion
>>> <>Issues
>>> that may be a concern:
>>> - lots of cross AZ latency in us-east, e.g. LOCAL_QUORUM ops must wait
>>> cross AZ . Also consider it during maintenance tasks, how much of a pain is
>>> it going to be to have latency between every node.
>>> - IMHO not having sufficient (by that I mean 3) replicas in a cassandra
>>> DC to handle a single node failure when working at Quorum reduces the
>>> utility of the DC. e.g. with a local RF of 2 in the west, the quorum is 2,
>>> and if you lose one node from the replica set you will not be able to use
>>> local QUORUM for keys in that range. Or consider a failure mode where the
>>> west is disconnected from the east.
>>> Could you start simple with 3 replicas in one AZ in us-east and 3
>>> replicas in an AZ+Region ?  Then work through some failure scenarios.
>>> Hope that helps.
>>> Aaron
>>> On 22 Apr 2011, at 03:28, William Oberman wrote:
>>> Hi,
>>> My service is not yet ready to be fully multi-DC, due to how some of my
>>> legacy MySQL stuff works.  But, I wanted to get cassandra going ASAP and
>>> work towards multi-DC.  I have two main cassandra use cases: one where I can
>>> handle eventual consistency (and all of the writes/reads are currently ONE),
>>> and one where I can't (writes/reads are currently QUORUM).  My test cluster
>>> is currently 4 smalls all in us-east with RF=3 (more to prove I can
>>> clustering, than to have an exact production replica).  All of my unit
>>> tests, and "load tests" (again, not to prove true max load, but to more to
>>> tease out concurrency issues) are passing now.
>>> For production, I was thinking of doing:
>>> -4 cassandra larges in us-east (where I am now), once in each AZ
>>> -1 cassandra large in us-west (where I have nothing)
>>> For now, my data can fit into a single large's 2 disk ephemeral using
>>> RAID0, and I was then thinking of doing a RF=3 with us-east=2 and
>>> us-west=1.  If I do eventual consistency at ONE, and consistency at
>>> LOCAL_QUORUM, I was hoping:
>>> -eventual consistency ops would be really fast
>>> -consistent ops would be pretty fast (what does LOCAL_QUORUM do in this
>>> case?  return after 1 or 2 us-east nodes ack?)
>>> -us-west would contain a complete copy of my data, so it's a good
>>> eventually consistent "close to real time" backup  (assuming it can keep up
>>> over long periods of time, but I think it should)
>>> -eventually, when I'm ready to roll out in us-west I'll be able to change
>>> the replication settings and that server in us-west could help seed new
>>> cassandra instances faster than the ones in us-east
>>> Or am I missing something really fundamental about how cassandra works
>>> making this a terrible plan?  I should have plenty of time to get my
>>> multi-DC working before the instance in us-west fills up (but even then, I
>>> should be able to add instances over there to stall fairly trivially,
>>> right?).
>>> Thanks!
>>> will
>> --
>> Will Oberman
>> Civic Science, Inc.
>> 3030 Penn Avenue., First Floor
>> Pittsburgh, PA 15201
>> (M) 412-480-7835
>> (E)
> --
> Will Oberman
> Civic Science, Inc.
> 3030 Penn Avenue., First Floor
> Pittsburgh, PA 15201
> (M) 412-480-7835
> (E)

Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835

View raw message