incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Turner <synfina...@gmail.com>
Subject Re: Backup solution
Date Fri, 15 Mar 2013 23:25:56 GMT
On Fri, Mar 15, 2013 at 10:35 AM, Rene Kochen
<rene.kochen@emea.schange.com> wrote:
> Hi Aaron,
>
> We have many deployments, but typically:
>
> - Live cluster of six nodes, replication factor = 3.
> - A node processes more reads than writes (approximately 100 get_slices
> per/second, narrow rows).
> - Data per node is about 50 to 100 GBytes.
> - We should recover within 4 hours.
>
> The idea is to put the backup cluster close to the live cluster with a
> gigabit connection only for Cassandra.

100 reads/sec/node doesn't sound like a lot to me... And 100G/node is
far below the recommended limit.  Sounds to me  you've possibly over
spec'd your cluster (not a bad thing, just an observation).  Of
course, if your data set is growing, then...

That said, I wouldn't consider a single node in a 2nd DC receiving
updates via Cassandra a "backup".  That's because a bug in cassandra
which corrupts your data or a user accidentally doing the wrong thing
(like issuing deletes they shouldn't) means that get's replicated to
all your nodes- including the one in the other DC.

A real backup would be to take snapshots on the nodes and then copy
them off the cluster.

I'd say replication is good if you want a hot-standby for a disaster
recovery site so you can quickly recover from a hardware fault.
Especially if you have a 4hr SLA, how are you going to get your
primary DC back up after a fire, earthquake, etc in 4 hours?  Heck, a
switch failure might knock you out for 4 hours depending on how
quickly you can swap another one in and how recent your config backups
are.

Better to have a DR site with a smaller set of nodes with the data
ready to go.  Maybe they won't be as fast, but hopefully you can make
sure the most important queries are handled.  But for that, I would
probably go with something more then just a single node in the DR DC.

One thing to remember is that compactions will impact the feasible
single node size to something smaller then you can potentially
allocate disk space for.   Ie: just because you can build a 4TB disk
array, doesn't mean you can have a single Cassandra node with 4TB of
data.  Typically, people around here seem to recommend ~400GB, but
that depends on hardware.

Honestly, for the price of a single computer you could test this
pretty easy.  That's what I'd do.

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

Mime
View raw message