cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrian Cockcroft <>
Subject Re: Multi-DC Deployment
Date Wed, 20 Apr 2011 20:01:06 GMT
Hi Terje,

If you feed data to two rings, you will get inconsistency drift as an
update to one succeeds and to the other fails from time to time. You
would have to build your own read repair. This all starts to look like
"I don't trust Cassandra code to work, so I will write my own buggy
one off versions of Cassandra functionality". I lean towards using
Cassandra features rather than rolling my own because there is a large
community testing, fixing and extending Cassandra, and making sure
that the algorithms are robust. Distributed systems are very hard to
get right, I trust lots of users and eyeballs on the code more than
even the best engineer working alone.

Cassandra doesn't "replicate sstable corruptions". It detects corrupt
data and only replicates good data. Also data isn't replicated to
three identical nodes in the way you imply, it's replicated around the
ring. If you lose three nodes, you don't lose a whole node's worth of
data.  We configure each replica to be in a different availability
zone so that we can lose a third of our nodes (a whole zone) and still
work. On a 300 node system with RF=3 and no zones, losing one or two
nodes you still have all your data, and can repair the loss quickly.
With three nodes dead at once you don't lose 1% of the data (3/300) I
think you lose 1/(300*300*300) of the data (someone check my math?).

If you want to always get a result, then you use "read one", if you
want to get a highly available better quality result use local quorum.
That is a per-query option.


On Tue, Apr 19, 2011 at 6:46 PM, Terje Marthinussen
<> wrote:
> If you have RF=3 in both datacenters, it could be discussed if there is a
> point to use the built in replication in Cassandra at all vs. feeding the
> data to both datacenters and get 2 100% isolated cassandra instances that
> cannot replicate sstable corruptions between each others....
> My point is really a bit more general though.
> For a lot services (especially Internet based ones) 100% accuracy in terms
> of results is not needed (or maybe even expected)
> While you want to serve a 100% correct result if you can (using quorum), it
> is still much better to serve a partial result than no result at all.
> Lets say you have 300 nodes in your ring, one document manages to trigger a
> bug in cassandra that brings down a node with all its replicas (3 nodes
> down)
> For many use cases, it would be much better to return the remaining 99% of
> the data coming from the 297 working nodes than having a service which
> returns nothing at all.
> I would however like the frontend to realize that this is an incomplete
> result so it is possible for it to react accordingly as well as be part of
> monitoring of the cassandra ring.
> Regards,
> Terje
> On Tue, Apr 19, 2011 at 6:06 PM, Adrian Cockcroft
> <> wrote:
>> If you want to use local quorum for a distributed setup, it doesn't
>> make sense to have less than RF=3 local and remote. Three copies at
>> both ends will give you high availability. Only one copy of the data
>> is sent over the wide area link (with recent versions).
>> There is no need to use mirrored or RAID5 disk in each node in this
>> case, since you are using RAIN (N for nodes) to protect your data. So
>> the extra disk space to hold three copies at each end shouldn't be a
>> big deal. Netflix is using striped internal disks on EC2 nodes for
>> this.
>> Adrian
>> On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
>> <> wrote:
>> > Hum...
>> > Seems like it could be an idea in a case like this with a mode where
>> > result
>> > is always returned (if possible), but where a flay saying if the
>> > consistency
>> > level was met, or to what level it was met (number of nodes answering
>> > for
>> > instance).?
>> > Terje
>> >
>> > On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis <>
>> > wrote:
>> >>
>> >> They will timeout until failure detector realizes the DC1 nodes are
>> >> down (~10 seconds). After that they will immediately return
>> >> UnavailableException until DC1 comes back up.
>> >>
>> >> On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
>> >> <> wrote:
>> >> > We are planning to deploy Cassandra on two data centers.   Let us
>> >> > that
>> >> > we went with three replicas with 2 being in one data center and last
>> >> > replica
>> >> > in 2nd Data center.
>> >> >
>> >> > What will happen to Quorum Reads and Writes when DC1 goes down (2 of
>> >> > 3
>> >> > replicas are unreachable)?  Will they timeout?
>> >> >
>> >> >
>> >> > Regards,
>> >> > Baskar
>> >>
>> >>
>> >>
>> >> --
>> >> Jonathan Ellis
>> >> Project Chair, Apache Cassandra
>> >> co-founder of DataStax, the source for professional Cassandra support
>> >>
>> >
>> >

View raw message