Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 40785 invoked from network); 8 Apr 2010 05:13:10 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Apr 2010 05:13:10 -0000 Received: (qmail 88615 invoked by uid 500); 8 Apr 2010 05:13:09 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 88599 invoked by uid 500); 8 Apr 2010 05:13:09 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 88591 invoked by uid 99); 8 Apr 2010 05:13:08 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Apr 2010 05:13:08 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [38.102.63.181] (HELO smtp-2.01.com) (38.102.63.181) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Apr 2010 05:13:00 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by smtp-2.01.com (Postfix) with ESMTP id 1D5761AD58C for ; Thu, 8 Apr 2010 00:12:39 -0500 (CDT) Received: from smtp-2.01.com ([127.0.0.1]) by localhost (smtp-2.01.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id mXFR3KsV-9cX for ; Thu, 8 Apr 2010 00:12:39 -0500 (CDT) Received: by smtp-2.01.com (Postfix, from userid 99) id 019B21AD82F; Thu, 8 Apr 2010 00:12:38 -0500 (CDT) Received: from mail-2.01.com (mail-2.01.com [38.102.63.172]) by smtp-2.01.com (Postfix) with ESMTP id E0A961AD58C for ; Thu, 8 Apr 2010 00:12:38 -0500 (CDT) Date: Thu, 8 Apr 2010 00:12:38 -0500 (CDT) From: David Timothy Strauss To: user@cassandra.apache.org Message-ID: <1954509045.199231.1270703558828.JavaMail.root@mail-2.01.com> In-Reply-To: <1137023758.199226.1270703541317.JavaMail.root@mail-2.01.com> Subject: Re: Integrity of batch_insert and also what about sharding? MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_199230_1633222591.1270703558817" X-Originating-IP: [70.112.179.175] X-Mailer: Zimbra 6.0.5_GA_2213.RHEL5_64 (ZimbraWebClient - SAF3 (Mac)/6.0.5_GA_2213.RHEL5_64) X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_199230_1633222591.1270703558817 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Based on empirical usage, Gossip chatter is quite manageable well beyond 100 nodes. One advantage of many small nodes is that the cost of node failure is small on rebuild. If you have 100 nodes with a hundred gigs each, the price you pay for a node's complete failure is pulling a hundred gigs from peers. If you only have 10 huge nodes with the same data quantity, you're paying 10x the cost in rebuild traffic and affecting a proportionally greater part of your infrastructure for the rebuild. Keep going horizontal until you hit the highest reasonable memory/disk ratio, and don't use a SAN. ----- "banks" wrote: What I'm trying to wrap my head around is what is the break even point... If I'm going to store 30terabytes in this thing... whats optimum to give me performance and scalability... is it best to be running 3 powerfull nodes, 100 smaller nodes, nodes on each web blade with 300g behind each... ya know? I'm sure there is a point where the gossip chatter becomes overwelming and ups and downs to each... I have not really seen a best practices document that gives the pro's and con's to each method of scaling. one 64proc 90gig memory mega machine running a single node cassandra... but on a raid5 SAN, good? bad? why? 30 web blades each running a cassandra node, each with 1tb local raid5 storage, good, bad, why? I get that every implimentation is different, what I'm looking for is what the known proven optimum is for this software... and whats to be avoided because its a given that it dosnt work. On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black < b@b3k.us > wrote: That depends on your goals for fault tolerance and recovery time. If you use RAID1 (or other redundant configuration) you can tolerate disk failure without Cassandra having to do repair. For large data sets, that can be a significant win. b On Wed, Apr 7, 2010 at 6:02 PM, banks < banksenus@gmail.com > wrote: > Then from an IT standpoint, if i'm using a RF of 3, it stands to reason that > running on Raid 1 makes sense, since RAID and RF achieve the same ends... it > makes sense to strip for speed and let cassandra deal with redundancy, eh? > > > On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black < b@b3k.us > wrote: >> >> On Wed, Apr 7, 2010 at 3:41 PM, banks < banksenus@gmail.com > wrote: >> > >> > 2. each cassandra node essentially has the same datastore as all nodes, >> > correct? >> >> No. The ReplicationFactor you set determines how many copies of a >> piece of data you want. If your number of nodes is higher than your >> RF, as is common, you will not have the same data on all nodes. The >> exact set of nodes to which data is replicated is determined by the >> row key, placement strategy, and node tokens. >> >> > So if I've got 3 terabytes of data and 3 cassandra nodes I'm >> > eating 9tb on the SAN? are there provisions for essentially sharding >> > across >> > nodes... so that each node only handles a given keyrange, if so where is >> > the >> > howto on that? >> > >> >> Sharding is a concept from databases that don't have native >> replication and so need a term to describe what they bolt on for the >> functionality. Distribution amongst nodes based on key ranges is how >> Cassandra always operates. >> >> >> b > > -- David Strauss | david@fourkitchens.com | +1 512 577 5827 [mobile] Four Kitchens | http://fourkitchens.com | +1 512 454 6659 [office] | +1 512 870 8453 [direct] ------=_Part_199230_1633222591.1270703558817 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 7bit
Based on empirical usage, Gossip chatter is quite manageable well beyond 100 nodes.

One advantage of many small nodes is that the cost of node failure is small on rebuild. If you have 100 nodes with a hundred gigs each, the price you pay for a node's complete failure is pulling a hundred gigs from peers. If you only have 10 huge nodes with the same data quantity, you're paying 10x the cost in rebuild traffic and affecting a proportionally greater part of your infrastructure for the rebuild.

Keep going horizontal until you hit the highest reasonable memory/disk ratio, and don't use a SAN.

----- "banks" <banksenus@gmail.com> wrote:
What I'm trying to wrap my head around is what is the break even point...

If I'm going to store 30terabytes in this thing... whats optimum to give me performance and scalability... is it best to be running 3 powerfull nodes, 100 smaller nodes, nodes on each web blade with 300g behind each...  ya know?  I'm sure there is a point where the gossip chatter becomes overwelming and ups and downs to each... I have not really seen a best practices document that gives the pro's and con's to each method of scaling.

one 64proc 90gig memory mega machine running a single node cassandra... but on a raid5 SAN, good? bad?  why?

30 web blades each running a cassandra node, each with 1tb local raid5 storage, good, bad, why?

I get that every implimentation is different, what I'm looking for is what the known proven optimum is for this software... and whats to be avoided because its a given that it dosnt work.

On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black <b@b3k.us> wrote:
That depends on your goals for fault tolerance and recovery time.  If
you use RAID1 (or other redundant configuration) you can tolerate disk
failure without Cassandra having to do repair.  For large data sets,
that can be a significant win.


b

On Wed, Apr 7, 2010 at 6:02 PM, banks <banksenus@gmail.com> wrote:
> Then from an IT standpoint, if i'm using a RF of 3, it stands to reason that
> running on Raid 1 makes sense, since RAID and RF achieve the same ends... it
> makes sense to strip for speed and let cassandra deal with redundancy, eh?
>
>
> On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black <b@b3k.us> wrote:
>>
>> On Wed, Apr 7, 2010 at 3:41 PM, banks <banksenus@gmail.com> wrote:
>> >
>> > 2. each cassandra node essentially has the same datastore as all nodes,
>> > correct?
>>
>> No.  The ReplicationFactor you set determines how many copies of a
>> piece of data you want.  If your number of nodes is higher than your
>> RF, as is common, you will not have the same data on all nodes.  The
>> exact set of nodes to which data is replicated is determined by the
>> row key, placement strategy, and node tokens.
>>
>> > So if I've got 3 terabytes of data and 3 cassandra nodes I'm
>> > eating 9tb on the SAN?  are there provisions for essentially sharding
>> > across
>> > nodes... so that each node only handles a given keyrange, if so where is
>> > the
>> > howto on that?
>> >
>>
>> Sharding is a concept from databases that don't have native
>> replication and so need a term to describe what they bolt on for the
>> functionality.  Distribution amongst nodes based on key ranges is how
>> Cassandra always operates.
>>
>>
>> b
>
>




--
David Strauss
   | david@fourkitchens.com
   | +1 512 577 5827 [mobile]
Four Kitchens
   | http://fourkitchens.com
   | +1 512 454 6659 [office]
   | +1 512 870 8453 [direct]
------=_Part_199230_1633222591.1270703558817--