Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DB82FD388 for ; Mon, 30 Jul 2012 03:46:18 +0000 (UTC) Received: (qmail 15791 invoked by uid 500); 30 Jul 2012 03:46:16 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 15758 invoked by uid 500); 30 Jul 2012 03:46:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 15742 invoked by uid 99); 30 Jul 2012 03:46:15 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jul 2012 03:46:15 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of edlinuxguru@gmail.com designates 209.85.213.44 as permitted sender) Received: from [209.85.213.44] (HELO mail-yw0-f44.google.com) (209.85.213.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Jul 2012 03:46:10 +0000 Received: by yhq56 with SMTP id 56so4732533yhq.31 for ; Sun, 29 Jul 2012 20:45:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=ZjisZKsI1ZLgbuCLGETniN8fboTZLG0tFgGr9wTxE6o=; b=DNCL41rBdxpmGhqq3tHyy4TF9ycJ/0ipksoOhtIE6O8qL7fkv1W91floeT0ac79eVT vbuRSdb3AZw1Wor70yBcFjsZBhiFB0nqgrRR5aysNx9O1hKRErqGb6GQnkvZnm5OjOrT JZ3l4SeV7I3PuFPjApj6z+psIWmuDdWPhMa/uE7JPqNDEgliWmJdocRFoMmwSEQj3SLv Yj8bYYC5oADDGC8rSSgAeCoMfaI2hd8oVDXerlwZBQXVnPYVrIc2VKN6V+IQmvq5bJON vevJ300qyaoaDDHk0PYNJWFUr53JuxtqsgLyLK4A44RGEv3s+Q6+S6TJpRkN6qXzuIde LHiA== MIME-Version: 1.0 Received: by 10.50.217.201 with SMTP id pa9mr10124262igc.54.1343619949278; Sun, 29 Jul 2012 20:45:49 -0700 (PDT) Received: by 10.64.102.196 with HTTP; Sun, 29 Jul 2012 20:45:48 -0700 (PDT) In-Reply-To: <10AD9722-9A7B-4CE8-8DB1-35CF0C2C94CD@ebureau.com> References: <10AD9722-9A7B-4CE8-8DB1-35CF0C2C94CD@ebureau.com> Date: Sun, 29 Jul 2012 23:45:48 -0400 Message-ID: Subject: Re: Practical node size limits From: Edward Capriolo To: user@cassandra.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Yikes. You should read: http://wiki.apache.org/cassandra/LargeDataSetConsiderations Essentially what it sounds like your are now running into is this: The BloomFilters for each SSTable must exist in main memory. Repair tends to create some extra data which normally gets compacted away later. Your best bet is to temporarily raise the Xmx heap and adjust the index sampling size. If you need to save the data (if it is just test data you may want to give up and start fresh) Generally the issue with the large disk configurations it is hard to keep a good ram/disk ratio. Then most reads turn into disk seeks and the throughput is low. I get the vibe people believe large stripes are going to help Cassandra. The issue is that stripes generally only increase sequential throughput, but Cassandra is a random read system. How much ram/disk you need is case dependent but 1/5 ratio of RAM to disk is where I think most people want to be, unless their system is carrying SSD disks. Again you have to keep your bloom filters in java heap memory so and design that tries to create a quatrillion small rows is going to have memory issues as well. On Sun, Jul 29, 2012 at 10:40 PM, Dustin Wenz wrot= e: > I'm trying to determine if there are any practical limits on the amount o= f data that a single node can handle efficiently, and if so, whether I've h= it that limit or not. > > We've just set up a new 7-node cluster with Cassandra 1.1.2 running under= OpenJDK6. Each node is 12-core Xeon with 24GB of RAM and is connected to a= stripe of 10 3TB disk mirrors (a total of 20 spindles each) and connected = via dual SATA-3 interconnects. I can read and write around 900MB/s sequenti= ally on the arrays. I started out with Cassandra tuned with all-default val= ues, with the exception of the compaction throughput which was increased fr= om 16MB/s to 100MB/s. These defaults will set the heap size to 6GB. > > Our schema is pretty simple; only 4 column families and each has one seco= ndary index. The replication factor was set to four, and compression disabl= ed. Our access patterns are intended to be about equal numbers of inserts a= nd selects, with no updates, and the occasional delete. > > The first thing we did was begin to load data into the cluster. We could = perform about 3000 inserts per second, which stayed mostly flat. Things sta= rted to go wrong around the time the nodes exceeded 800GB. Cassandra began = to generate a lot of "mutations messages dropped" warnings, and was complai= ning that the heap was over 75% capacity. > > At that point, we stopped all activity on the cluster and attempted a rep= air. We did this so we could be sure that the data was fully consistent bef= ore continuing. Our mistake was probably trying to repair all of the nodes = simultaneously - within an hour, Java terminated on one of the nodes with a= heap out-of-memory message. I then increased all of the heap sizes to 8GB,= and reduced the heap_newsize to 800MB. All of the nodes were restarted, an= d there was no no outside activity on the cluster. I then began a repair on= a single node. Within a few hours, it OOMed again and exited. I then incre= ased the heap to 12GB, and attempted the same thing. This time, the repair = ran for about 7 hours before exiting from an OOM condition. > > By now, the repair had increased the amount of data on some of the nodes = to over 1.2TB. There is no going back to a 6GB heap size - Cassandra now ex= its with an OOM during startup unless the heap is set higher. It's at 16GB = now, and a single node has been repairing for a couple of days. Though I ha= ve no personal experience with this, I've been told that Java's garbage col= lector doesn't perform well with heaps above 8GB. I'm wary of setting it hi= gher, but I can add up to 192GB of RAM to each node if necessary. > > How much heap does cassandra need for this amount of data with only four = CFs? Am I scaling this cluster in completely the wrong direction? Is there = a magic garbage collection setting that I need to add in cassandra-env that= isn't there by default? > > Thanks, > > - .Dustin