cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Alexander <>
Subject RE: Integrity of batch_insert and also what about sharding?
Date Thu, 08 Apr 2010 02:28:43 GMT
Well, IANAITG (I Am Not An IT Guy), but outside of the normal benefits you get from a SAN (that
you can, of course, get from other options) is that I believe our IT group likes it for the
management aspects - they like to buy a BigAssSAN(tm) and provision storage to different clusters,
environments, etc... I'm sure it's also heavily weighted by the fact that it's "the devil
we know". 

-----Original Message-----
From: Benjamin Black [] 
Sent: Wednesday, April 07, 2010 9:05 PM
Subject: Re: Integrity of batch_insert and also what about sharding?

What benefit does a SAN give you?  I've generally been confused by
that approach, so I'm assuming I am missing something.

On Wed, Apr 7, 2010 at 6:58 PM, Jason Alexander
<> wrote:
> FWIW, I'd love to see some guidance here too -
> From our standpoint, we'll be consolidating the various sites' (,, etc...) data into a single data warehouse, running Cassandra. We're looking
at roughly the same amounts of data (30TB's or more). We were assuming 3-5 big servers sitting
atop a SAN. But, again, just a guess and following existing conventions we use for other systems.
> ________________________________________
> From: banks []
> Sent: Wednesday, April 07, 2010 8:47 PM
> To:
> Subject: Re: Integrity of batch_insert and also what about sharding?
> What I'm trying to wrap my head around is what is the break even point...
> If I'm going to store 30terabytes in this thing... whats optimum to give me performance
and scalability... is it best to be running 3 powerfull nodes, 100 smaller nodes, nodes on
each web blade with 300g behind each...  ya know?  I'm sure there is a point where the gossip
chatter becomes overwelming and ups and downs to each... I have not really seen a best practices
document that gives the pro's and con's to each method of scaling.
> one 64proc 90gig memory mega machine running a single node cassandra... but on a raid5
SAN, good? bad?  why?
> 30 web blades each running a cassandra node, each with 1tb local raid5 storage, good,
bad, why?
> I get that every implimentation is different, what I'm looking for is what the known
proven optimum is for this software... and whats to be avoided because its a given that it
dosnt work.
> On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black <<>>
> That depends on your goals for fault tolerance and recovery time.  If
> you use RAID1 (or other redundant configuration) you can tolerate disk
> failure without Cassandra having to do repair.  For large data sets,
> that can be a significant win.
> b
> On Wed, Apr 7, 2010 at 6:02 PM, banks <<>>
>> Then from an IT standpoint, if i'm using a RF of 3, it stands to reason that
>> running on Raid 1 makes sense, since RAID and RF achieve the same ends... it
>> makes sense to strip for speed and let cassandra deal with redundancy, eh?
>> On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black <<>>
>>> On Wed, Apr 7, 2010 at 3:41 PM, banks <<>>
>>> >
>>> > 2. each cassandra node essentially has the same datastore as all nodes,
>>> > correct?
>>> No.  The ReplicationFactor you set determines how many copies of a
>>> piece of data you want.  If your number of nodes is higher than your
>>> RF, as is common, you will not have the same data on all nodes.  The
>>> exact set of nodes to which data is replicated is determined by the
>>> row key, placement strategy, and node tokens.
>>> > So if I've got 3 terabytes of data and 3 cassandra nodes I'm
>>> > eating 9tb on the SAN?  are there provisions for essentially sharding
>>> > across
>>> > nodes... so that each node only handles a given keyrange, if so where is
>>> > the
>>> > howto on that?
>>> >
>>> Sharding is a concept from databases that don't have native
>>> replication and so need a term to describe what they bolt on for the
>>> functionality.  Distribution amongst nodes based on key ranges is how
>>> Cassandra always operates.
>>> b

View raw message