cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Eran Kutner <>
Subject Re: Questions while evaluating Cassandra
Date Thu, 04 Mar 2010 08:51:49 GMT
Thanks Jonathan,
A few more clarification questions below.


On Tue, Mar 2, 2010 at 15:44, Jonathan Ellis <> wrote:
> On Tue, Mar 2, 2010 at 6:43 AM, Eran Kutner <> wrote:
> > Is the procedure described in the description of ticket CASSANDRA-44 really
> > the way to do schema changes in the latest release? I'm not sure what's your
> > thoughts about this but our experience is that every release of our software
> > requires schema changes because we add new column families for indexes.
> Yes, that is how it is for 0.5 and 0.6.  0.7 will add online schema
> changes (i.e., fix -44), Gary is working on that now.

So just to be clear, that would require a complete cluster restart as
well as stopping the client app (to prevent new writes from coming in
after doing the flush), right? Do you know how others are handling it
on a live system?

> > Any idea on the timeframe for 0.7?
> We are trying for 3-4 months, i.e. roughly the same as as our last 4 releases.
> > Our application needs a lot of range scans. Is there anything being done to
> > improve the poor range scan performance as reflected here:
> > ?
> is open, also for
> the 0.7 release.  Johan is working on this.
> > What is the reason for the replication strategy with two DCs? As far as I
> > understand it means that only one replica will exist in the second DC. It
> > also means that quorum reads will fail when attempted on the second DC while
> > the first DC is down. Am I missing something?
> Yes:
>  - That strategy is meant for doing reads w/ CL.ONE; it guarantees at
> least one replica in each DC, for low latency with that CL
>  -  Quorum is based on the whole cluster, not per-DC.
> DatacenterShardStrategy will put multiple replicas in each DC, for use
> with CL.DCQUORUM, that is, a majority of replicas in the same DC as
> the coordinator node for the current request.  DCQOURUM is not yet
> finished, though; currently it behaves the same as CL.ALL.

Is it planned for any specific release?

> > Are there any plans to have a inter-cluster replication option? I mean
> > having two clusters running in two DCs, each will be stand alone but they
> > will replicate data between themselves.
> No.  This is worse in every respect, since it means you get to
> reinvent the existing repair, hinted handoff, etc code for when
> replication breaks, poorly.

I'm not sure I understand why you would need to redo all of that. As a
trivial design assume that every write in DC1 was logged to a system
table which would be just a standard Cassandra table, writes are cheap
anyway, so doing another write on every write is a reasonable
tradeoff. Then, a background service would read that system with
CL.ALL table and write the data to the DC2, again with CL.ALL. With a
single server/thread doing the replication it's almost trivial, but
even with more servers/threads I think it can still be managed with
very small changes to the existing Cassandra system.

> > This can avoid the problem mentioned
> > above, as well as avoid the high cost of inter-DC traffic when doing
> > Read-Repairs for every read.
> Of course if you don't RR then you can read inconsistent data until
> your next full repair.   Not a good trade.  Remember RR is done in the
> background so the latency doesn't matter.

I am more concerned about the actual cost of the bandwidth, on a
typical application with 80-90% reads doing RR means you need a very
wide link between the DCs. It's probably going to be even worse after
CL.DCQUORUM is available because then more data will have to be read
from the remote DC.

> > From everything I've read I didn't understand if load balancing is local or
> > global. In other words, what happens exactly when a new node is added? Will
> > it only balance its two neighbors on the ring or will the re-balance
> > propagate through the ring and all the nodes will be rebalanced evenly?
> The former.  Cascading data moves around the ring is a Bad Idea.
> (Since you read the Yahoo hbase/cassandra paper -- if hbase does this,
> maybe that is why adding a new node basically kills their cluster for
> several minutes?)

I don't know exactly what they are doing there but in general, since
the data layer (HDFS) is separate from the DB layer (HBase) they
should be able to reassign key ranges to other region servers quite
easily. I can only assume the slow down happens because the region
server has to flush all its memory tables to disk before being able to
split the ranges. Re-balancing the HDFS is definitely not done
automatically, they have a "balancer" service that has to be run
manually to balance HDFS blocks after adding/removing nodes but it
does it work slowly in the background.

> > I see that Hadoop support is coming in 0.6 but from following the ticket on
> > Jira (CASSANDRA-342) I didn't understand if it will support the
> > orderPreservingPartitioner or not.
> It supports all partitioners.
> > Do the clients have to be recompiled and deployed when a new version of
> > Cassandra is deployed, or are new releases backward compatible?
> The short answer is, we maintained backwards compatibility for 0.4 ->
> 0.5 -> 0.6, but we are going to break things in 0.7 moving from String
> keys to byte[] and possibly other changes.

hmmm... My assumption was that although keys are strings they are
still compared as bytes when using the OPP right? That would be the
difference between the OPP and the COPP, right? Just confirming
because otherwise creating composite keys with different data types
may prove problematic.

> -Jonathan

View raw message