cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Benjamin Black...@b3k.us>
Subject Re: Integrity of batch_insert and also what about sharding?
Date Thu, 08 Apr 2010 02:04:59 GMT
What benefit does a SAN give you?  I've generally been confused by
that approach, so I'm assuming I am missing something.

On Wed, Apr 7, 2010 at 6:58 PM, Jason Alexander
<Jason.Alexander@match.com> wrote:
> FWIW, I'd love to see some guidance here too -
>
> From our standpoint, we'll be consolidating the various Match.com sites' (match.com,
chemistry.com, etc...) data into a single data warehouse, running Cassandra. We're looking
at roughly the same amounts of data (30TB's or more). We were assuming 3-5 big servers sitting
atop a SAN. But, again, just a guess and following existing conventions we use for other systems.
>
>
> ________________________________________
> From: banks [banksenus@gmail.com]
> Sent: Wednesday, April 07, 2010 8:47 PM
> To: user@cassandra.apache.org
> Subject: Re: Integrity of batch_insert and also what about sharding?
>
> What I'm trying to wrap my head around is what is the break even point...
>
> If I'm going to store 30terabytes in this thing... whats optimum to give me performance
and scalability... is it best to be running 3 powerfull nodes, 100 smaller nodes, nodes on
each web blade with 300g behind each...  ya know?  I'm sure there is a point where the gossip
chatter becomes overwelming and ups and downs to each... I have not really seen a best practices
document that gives the pro's and con's to each method of scaling.
>
> one 64proc 90gig memory mega machine running a single node cassandra... but on a raid5
SAN, good? bad?  why?
>
> 30 web blades each running a cassandra node, each with 1tb local raid5 storage, good,
bad, why?
>
> I get that every implimentation is different, what I'm looking for is what the known
proven optimum is for this software... and whats to be avoided because its a given that it
dosnt work.
>
> On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black <b@b3k.us<mailto:b@b3k.us>>
wrote:
> That depends on your goals for fault tolerance and recovery time.  If
> you use RAID1 (or other redundant configuration) you can tolerate disk
> failure without Cassandra having to do repair.  For large data sets,
> that can be a significant win.
>
>
> b
>
> On Wed, Apr 7, 2010 at 6:02 PM, banks <banksenus@gmail.com<mailto:banksenus@gmail.com>>
wrote:
>> Then from an IT standpoint, if i'm using a RF of 3, it stands to reason that
>> running on Raid 1 makes sense, since RAID and RF achieve the same ends... it
>> makes sense to strip for speed and let cassandra deal with redundancy, eh?
>>
>>
>> On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black <b@b3k.us<mailto:b@b3k.us>>
wrote:
>>>
>>> On Wed, Apr 7, 2010 at 3:41 PM, banks <banksenus@gmail.com<mailto:banksenus@gmail.com>>
wrote:
>>> >
>>> > 2. each cassandra node essentially has the same datastore as all nodes,
>>> > correct?
>>>
>>> No.  The ReplicationFactor you set determines how many copies of a
>>> piece of data you want.  If your number of nodes is higher than your
>>> RF, as is common, you will not have the same data on all nodes.  The
>>> exact set of nodes to which data is replicated is determined by the
>>> row key, placement strategy, and node tokens.
>>>
>>> > So if I've got 3 terabytes of data and 3 cassandra nodes I'm
>>> > eating 9tb on the SAN?  are there provisions for essentially sharding
>>> > across
>>> > nodes... so that each node only handles a given keyrange, if so where is
>>> > the
>>> > howto on that?
>>> >
>>>
>>> Sharding is a concept from databases that don't have native
>>> replication and so need a term to describe what they bolt on for the
>>> functionality.  Distribution amongst nodes based on key ranges is how
>>> Cassandra always operates.
>>>
>>>
>>> b
>>
>>
>
>

Mime
View raw message