Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 10794 invoked from network); 8 Apr 2010 02:05:28 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Apr 2010 02:05:28 -0000 Received: (qmail 37542 invoked by uid 500); 8 Apr 2010 02:05:27 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 37464 invoked by uid 500); 8 Apr 2010 02:05:27 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 37456 invoked by uid 99); 8 Apr 2010 02:05:27 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Apr 2010 02:05:27 +0000 X-ASF-Spam-Status: No, hits=0.7 required=10.0 tests=RCVD_IN_DNSWL_NONE,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [74.125.92.25] (HELO qw-out-2122.google.com) (74.125.92.25) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Apr 2010 02:05:20 +0000 Received: by qw-out-2122.google.com with SMTP id 8so596716qwh.61 for ; Wed, 07 Apr 2010 19:04:59 -0700 (PDT) MIME-Version: 1.0 Received: by 10.229.233.75 with HTTP; Wed, 7 Apr 2010 19:04:59 -0700 (PDT) In-Reply-To: References: Date: Wed, 7 Apr 2010 19:04:59 -0700 Received: by 10.229.214.7 with SMTP id gy7mr1121901qcb.12.1270692299185; Wed, 07 Apr 2010 19:04:59 -0700 (PDT) Message-ID: Subject: Re: Integrity of batch_insert and also what about sharding? From: Benjamin Black To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org What benefit does a SAN give you? I've generally been confused by that approach, so I'm assuming I am missing something. On Wed, Apr 7, 2010 at 6:58 PM, Jason Alexander wrote: > FWIW, I'd love to see some guidance here too - > > From our standpoint, we'll be consolidating the various Match.com sites' = (match.com, chemistry.com, etc...) data into a single data warehouse, runni= ng Cassandra. We're looking at roughly the same amounts of data (30TB's or = more). We were assuming 3-5 big servers sitting atop a SAN. But, again, jus= t a guess and following existing conventions we use for other systems. > > > ________________________________________ > From: banks [banksenus@gmail.com] > Sent: Wednesday, April 07, 2010 8:47 PM > To: user@cassandra.apache.org > Subject: Re: Integrity of batch_insert and also what about sharding? > > What I'm trying to wrap my head around is what is the break even point... > > If I'm going to store 30terabytes in this thing... whats optimum to give = me performance and scalability... is it best to be running 3 powerfull node= s, 100 smaller nodes, nodes on each web blade with 300g behind each... =A0y= a know? =A0I'm sure there is a point where the gossip chatter becomes overw= elming and ups and downs to each... I have not really seen a best practices= document that gives the pro's and con's to each method of scaling. > > one 64proc 90gig memory mega machine running a single node cassandra... b= ut on a raid5 SAN, good? bad? =A0why? > > 30 web blades each running a cassandra node, each with 1tb local raid5 st= orage, good, bad, why? > > I get that every implimentation is different, what I'm looking for is wha= t the known proven optimum is for this software... and whats to be avoided = because its a given that it dosnt work. > > On Wed, Apr 7, 2010 at 6:40 PM, Benjamin Black = > wrote: > That depends on your goals for fault tolerance and recovery time. =A0If > you use RAID1 (or other redundant configuration) you can tolerate disk > failure without Cassandra having to do repair. =A0For large data sets, > that can be a significant win. > > > b > > On Wed, Apr 7, 2010 at 6:02 PM, banks > wrote: >> Then from an IT standpoint, if i'm using a RF of 3, it stands to reason = that >> running on Raid 1 makes sense, since RAID and RF achieve the same ends..= . it >> makes sense to strip for speed and let cassandra deal with redundancy, e= h? >> >> >> On Wed, Apr 7, 2010 at 4:07 PM, Benjamin Black > wrote: >>> >>> On Wed, Apr 7, 2010 at 3:41 PM, banks > wrote: >>> > >>> > 2. each cassandra node essentially has the same datastore as all node= s, >>> > correct? >>> >>> No. =A0The ReplicationFactor you set determines how many copies of a >>> piece of data you want. =A0If your number of nodes is higher than your >>> RF, as is common, you will not have the same data on all nodes. =A0The >>> exact set of nodes to which data is replicated is determined by the >>> row key, placement strategy, and node tokens. >>> >>> > So if I've got 3 terabytes of data and 3 cassandra nodes I'm >>> > eating 9tb on the SAN? =A0are there provisions for essentially shardi= ng >>> > across >>> > nodes... so that each node only handles a given keyrange, if so where= is >>> > the >>> > howto on that? >>> > >>> >>> Sharding is a concept from databases that don't have native >>> replication and so need a term to describe what they bolt on for the >>> functionality. =A0Distribution amongst nodes based on key ranges is how >>> Cassandra always operates. >>> >>> >>> b >> >> > >