Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A9540DA36 for ; Tue, 4 Sep 2012 19:43:06 +0000 (UTC) Received: (qmail 27159 invoked by uid 500); 4 Sep 2012 19:43:04 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 27139 invoked by uid 500); 4 Sep 2012 19:43:04 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 27129 invoked by uid 99); 4 Sep 2012 19:43:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 19:43:04 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FSL_RCVD_USER,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [65.127.24.21] (HELO internet02.ebureau.com) (65.127.24.21) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 Sep 2012 19:42:59 +0000 Received: from service02.office.ebureau.com (service02.office.ebureau.com [192.168.20.15]) by internet02.ebureau.com (Postfix) with ESMTP id E900BD8E794 for ; Tue, 4 Sep 2012 14:42:37 -0500 (CDT) Received: from localhost (localhost [127.0.0.1]) by service02.office.ebureau.com (Postfix) with ESMTP id E5094A99462E for ; Tue, 4 Sep 2012 14:42:37 -0500 (CDT) X-Virus-Scanned: amavisd-new at ebureau.com Received: from service02.office.ebureau.com ([127.0.0.1]) by localhost (service02.office.iscompanies.com [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id GHl1DRtnfda6 for ; Tue, 4 Sep 2012 14:42:37 -0500 (CDT) Received: from square.office.ebureau.com (square.office.ebureau.com [10.10.20.22]) by service02.office.ebureau.com (Postfix) with ESMTPSA id 72F99A99461C for ; Tue, 4 Sep 2012 14:42:37 -0500 (CDT) Content-Type: text/plain; charset=iso-8859-1 Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1485\)) Subject: Re: Practical node size limits From: Dustin Wenz In-Reply-To: Date: Tue, 4 Sep 2012 14:42:37 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: <10AD9722-9A7B-4CE8-8DB1-35CF0C2C94CD@ebureau.com> To: "user@cassandra.apache.org" X-Mailer: Apple Mail (2.1485) X-Virus-Checked: Checked by ClamAV on apache.org I'm following up on this issue, which I've been monitoring for the last = several weeks. I thought people might find my observations interesting. Ever since increasing the heap size to 64GB, we've had no OOM conditions = that resulted in a JVM termination. Our nodes have around 2.5TB of data = each, and the replication factor is four. IO on the cluster seems to be = fine, though I haven't been paying particular attention to any GC hangs. The bottleneck now seems to be the repair time. If any node becomes too = inconsistent, or needs to be replaced, the rebuilt time is over a week. = That issue alone makes this cluster configuration unsuitable for = production use. - .Dustin On Jul 30, 2012, at 2:04 PM, Dustin Wenz wrote: > Thanks for the pointer! It sounds likely that's what I'm seeing. = CFStats reports that the bloom filter size is currently several = gigabytes. Is there any way to estimate how much heap space a repair = would require? Is it a function of simply adding up the filter file = sizes, plus some fraction of neighboring nodes? >=20 > I'm still curious about the largest heap sizes that people are running = with on their deployments. I'm considering increasing ours to 64GB (with = 96GB physical memory) to see where that gets us. Would it be necessary = to keep the young-gen size small to avoid long GC pauses? I also suspect = that I may need to keep my memtable sizes small to avoid long flushes; = maybe in the 1-2GB range. >=20 > - .Dustin >=20 > On Jul 29, 2012, at 10:45 PM, Edward Capriolo = wrote: >=20 >> Yikes. You should read: >>=20 >> http://wiki.apache.org/cassandra/LargeDataSetConsiderations >>=20 >> Essentially what it sounds like your are now running into is this: >>=20 >> The BloomFilters for each SSTable must exist in main memory. Repair >> tends to create some extra data which normally gets compacted away >> later. >>=20 >> Your best bet is to temporarily raise the Xmx heap and adjust the >> index sampling size. If you need to save the data (if it is just test >> data you may want to give up and start fresh) >>=20 >> Generally the issue with the large disk configurations it is hard to >> keep a good ram/disk ratio. Then most reads turn into disk seeks and >> the throughput is low. I get the vibe people believe large stripes = are >> going to help Cassandra. The issue is that stripes generally only >> increase sequential throughput, but Cassandra is a random read = system. >>=20 >> How much ram/disk you need is case dependent but 1/5 ratio of RAM to >> disk is where I think most people want to be, unless their system is >> carrying SSD disks. >>=20 >> Again you have to keep your bloom filters in java heap memory so and >> design that tries to create a quatrillion small rows is going to have >> memory issues as well. >>=20 >> On Sun, Jul 29, 2012 at 10:40 PM, Dustin Wenz = wrote: >>> I'm trying to determine if there are any practical limits on the = amount of data that a single node can handle efficiently, and if so, = whether I've hit that limit or not. >>>=20 >>> We've just set up a new 7-node cluster with Cassandra 1.1.2 running = under OpenJDK6. Each node is 12-core Xeon with 24GB of RAM and is = connected to a stripe of 10 3TB disk mirrors (a total of 20 spindles = each) and connected via dual SATA-3 interconnects. I can read and write = around 900MB/s sequentially on the arrays. I started out with Cassandra = tuned with all-default values, with the exception of the compaction = throughput which was increased from 16MB/s to 100MB/s. These defaults = will set the heap size to 6GB. >>>=20 >>> Our schema is pretty simple; only 4 column families and each has one = secondary index. The replication factor was set to four, and compression = disabled. Our access patterns are intended to be about equal numbers of = inserts and selects, with no updates, and the occasional delete. >>>=20 >>> The first thing we did was begin to load data into the cluster. We = could perform about 3000 inserts per second, which stayed mostly flat. = Things started to go wrong around the time the nodes exceeded 800GB. = Cassandra began to generate a lot of "mutations messages dropped" = warnings, and was complaining that the heap was over 75% capacity. >>>=20 >>> At that point, we stopped all activity on the cluster and attempted = a repair. We did this so we could be sure that the data was fully = consistent before continuing. Our mistake was probably trying to repair = all of the nodes simultaneously - within an hour, Java terminated on one = of the nodes with a heap out-of-memory message. I then increased all of = the heap sizes to 8GB, and reduced the heap_newsize to 800MB. All of the = nodes were restarted, and there was no no outside activity on the = cluster. I then began a repair on a single node. Within a few hours, it = OOMed again and exited. I then increased the heap to 12GB, and attempted = the same thing. This time, the repair ran for about 7 hours before = exiting from an OOM condition. >>>=20 >>> By now, the repair had increased the amount of data on some of the = nodes to over 1.2TB. There is no going back to a 6GB heap size - = Cassandra now exits with an OOM during startup unless the heap is set = higher. It's at 16GB now, and a single node has been repairing for a = couple of days. Though I have no personal experience with this, I've = been told that Java's garbage collector doesn't perform well with heaps = above 8GB. I'm wary of setting it higher, but I can add up to 192GB of = RAM to each node if necessary. >>>=20 >>> How much heap does cassandra need for this amount of data with only = four CFs? Am I scaling this cluster in completely the wrong direction? = Is there a magic garbage collection setting that I need to add in = cassandra-env that isn't there by default? >>>=20 >>> Thanks, >>>=20 >>> - .Dustin >=20