cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cliff Moon <cl...@moonpolysoft.com>
Subject Re: Integrity of batch_insert and also what about sharding?
Date Thu, 08 Apr 2010 04:38:24 GMT
Putting cassandra's data directories on a SAN is like putting a bunch of 
F1's on one of those big car carrier trucks and entering a race with the 
truck.  You know, since you have so much horsepower.

On 4/7/10 7:28 PM, Jason Alexander wrote:
> Well, IANAITG (I Am Not An IT Guy), but outside of the normal benefits you get from a
SAN (that you can, of course, get from other options) is that I believe our IT group likes
it for the management aspects - they like to buy a BigAssSAN(tm) and provision storage to
different clusters, environments, etc... I'm sure it's also heavily weighted by the fact that
it's "the devil we know".
>
> -----Original Message-----
> From: Benjamin Black [mailto:b@b3k.us]
> Sent: Wednesday, April 07, 2010 9:05 PM
> To: user@cassandra.apache.org
> Subject: Re: Integrity of batch_insert and also what about sharding?
>
> What benefit does a SAN give you?  I've generally been confused by
> that approach, so I'm assuming I am missing something.
>
> On Wed, Apr 7, 2010 at 6:58 PM, Jason Alexander
> <Jason.Alexander@match.com>  wrote:
>    
>> FWIW, I'd love to see some guidance here too -
>>
>>  From our standpoint, we'll be consolidating the various Match.com sites' (match.com,
chemistry.com, etc...) data into a single data warehouse, running Cassandra. We're looking
at roughly the same amounts of data (30TB's or more). We were assuming 3-5 big servers sitting
atop a SAN. But, again, just a guess and following existing conventions we use for other systems.
>>
>>
>> ________________________________________
>> From: banks [banksenus@gmail.com]
>> Sent: Wednesday, April 07, 2010 8:47 PM
>> To: user@cassandra.apache.org
>> Subject: Re: Integrity of batch_insert and also what about sharding?
>>
>> What I'm trying to wrap my head around is what is the break even point...
>>
>> If I'm going to store 30terabytes in this thing... whats optimum to give me performance
and scalability... is it best to be running 3 powerfull nodes, 100 smaller nodes, nodes on
each web blade with 300g behind each...  ya know?  I'm sure there is a point where the gossip
chatter becomes overwelming and ups and downs to each... I have not really seen a best practices
document that gives the pro's and con's to each method of scaling.
>>
>> one 64proc 90gig memory mega machine running a single node cassandra... but on a
raid5 SAN, good? bad?  why?
>>
>> 30 web blades each running a cassandra node, each with 1tb local raid5 storage, good,
bad, why?
>>
>> I get that every implimentation is different, what I'm looking for is what the known
proven optimum is for this software... and whats to be avoided because its a given that it
dosnt work.
>>
>> On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black<b@b3k.us<mailto:b@b3k.us>>
 wrote:
>> That depends on your goals for fault tolerance and recovery time.  If
>> you use RAID1 (or other redundant configuration) you can tolerate disk
>> failure without Cassandra having to do repair.  For large data sets,
>> that can be a significant win.
>>
>>
>> b
>>
>> On Wed, Apr 7, 2010 at 6:02 PM, banks<banksenus@gmail.com<mailto:banksenus@gmail.com>>
 wrote:
>>      
>>> Then from an IT standpoint, if i'm using a RF of 3, it stands to reason that
>>> running on Raid 1 makes sense, since RAID and RF achieve the same ends... it
>>> makes sense to strip for speed and let cassandra deal with redundancy, eh?
>>>
>>>
>>> On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black<b@b3k.us<mailto:b@b3k.us>>
 wrote:
>>>        
>>>> On Wed, Apr 7, 2010 at 3:41 PM, banks<banksenus@gmail.com<mailto:banksenus@gmail.com>>
 wrote:
>>>>          
>>>>> 2. each cassandra node essentially has the same datastore as all nodes,
>>>>> correct?
>>>>>            
>>>> No.  The ReplicationFactor you set determines how many copies of a
>>>> piece of data you want.  If your number of nodes is higher than your
>>>> RF, as is common, you will not have the same data on all nodes.  The
>>>> exact set of nodes to which data is replicated is determined by the
>>>> row key, placement strategy, and node tokens.
>>>>
>>>>          
>>>>> So if I've got 3 terabytes of data and 3 cassandra nodes I'm
>>>>> eating 9tb on the SAN?  are there provisions for essentially sharding
>>>>> across
>>>>> nodes... so that each node only handles a given keyrange, if so where
is
>>>>> the
>>>>> howto on that?
>>>>>
>>>>>            
>>>> Sharding is a concept from databases that don't have native
>>>> replication and so need a term to describe what they bolt on for the
>>>> functionality.  Distribution amongst nodes based on key ranges is how
>>>> Cassandra always operates.
>>>>
>>>>
>>>> b
>>>>          
>>>
>>>        
>>
>>      


Mime
View raw message