Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
Content-Type: text/plain; charset=iso-8859-1
Mime-Version: 1.0 (Mac OS X Mail 6.0 \(1485\))
Subject: Re: Practical node size limits
From: Dustin Wenz <dustinwenz@ebureau.com>
In-Reply-To: <E8C2F79F-A338-4C27-904B-327E09174E48@ebureau.com>
Date: Tue, 4 Sep 2012 14:42:37 -0500
Content-Transfer-Encoding: quoted-printable
Message-Id: <AF295EA7-BAAD-4193-903F-6FA1487ABAEB@ebureau.com>
References: <10AD9722-9A7B-4CE8-8DB1-35CF0C2C94CD@ebureau.com>
 <CAENxBwzpziN+S7+3wkB92JhuwTN5U55JqxzHjoc0YoBFoKZVcA@mail.gmail.com>
 <E8C2F79F-A338-4C27-904B-327E09174E48@ebureau.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>

I'm following up on this issue, which I've been monitoring for the last =
several weeks. I thought people might find my observations interesting.

Ever since increasing the heap size to 64GB, we've had no OOM conditions =
that resulted in a JVM termination. Our nodes have around 2.5TB of data =
each, and the replication factor is four. IO on the cluster seems to be =
fine, though I haven't been paying particular attention to any GC hangs.

The bottleneck now seems to be the repair time. If any node becomes too =
inconsistent, or needs to be replaced, the rebuilt time is over a week. =
That issue alone makes this cluster configuration unsuitable for =
production use.

	- .Dustin

On Jul 30, 2012, at 2:04 PM, Dustin Wenz <DustinWenz@ebureau.com> wrote:

> Thanks for the pointer! It sounds likely that's what I'm seeing. =
CFStats reports that the bloom filter size is currently several =
gigabytes. Is there any way to estimate how much heap space a repair =
would require? Is it a function of simply adding up the filter file =
sizes, plus some fraction of neighboring nodes?
>=20
> I'm still curious about the largest heap sizes that people are running =
with on their deployments. I'm considering increasing ours to 64GB (with =
96GB physical memory) to see where that gets us. Would it be necessary =
to keep the young-gen size small to avoid long GC pauses? I also suspect =
that I may need to keep my memtable sizes small to avoid long flushes; =
maybe in the 1-2GB range.
>=20
> 	- .Dustin
>=20
> On Jul 29, 2012, at 10:45 PM, Edward Capriolo <edlinuxguru@gmail.com> =
wrote:
>=20
>> Yikes. You should read:
>>=20
>> http://wiki.apache.org/cassandra/LargeDataSetConsiderations
>>=20
>> Essentially what it sounds like your are now running into is this:
>>=20
>> The BloomFilters for each SSTable must exist in main memory. Repair
>> tends to create some extra data which normally gets compacted away
>> later.
>>=20
>> Your best bet is to temporarily raise the Xmx heap and adjust the
>> index sampling size. If you need to save the data (if it is just test
>> data you may want to give up and start fresh)
>>=20
>> Generally the issue with the large disk configurations it is hard to
>> keep a good ram/disk ratio. Then most reads turn into disk seeks and
>> the throughput is low. I get the vibe people believe large stripes =
are
>> going to help Cassandra. The issue is that stripes generally only
>> increase sequential throughput, but Cassandra is a random read =
system.
>>=20
>> How much ram/disk you need is case dependent but 1/5 ratio of RAM to
>> disk is where I think most people want to be, unless their system is
>> carrying SSD disks.
>>=20
>> Again you have to keep your bloom filters in java heap memory so and
>> design that tries to create a quatrillion small rows is going to have
>> memory issues as well.
>>=20
>> On Sun, Jul 29, 2012 at 10:40 PM, Dustin Wenz =
<dustinwenz@ebureau.com> wrote:
>>> I'm trying to determine if there are any practical limits on the =
amount of data that a single node can handle efficiently, and if so, =
whether I've hit that limit or not.
>>>=20
>>> We've just set up a new 7-node cluster with Cassandra 1.1.2 running =
under OpenJDK6. Each node is 12-core Xeon with 24GB of RAM and is =
connected to a stripe of 10 3TB disk mirrors (a total of 20 spindles =
each) and connected via dual SATA-3 interconnects. I can read and write =
around 900MB/s sequentially on the arrays. I started out with Cassandra =
tuned with all-default values, with the exception of the compaction =
throughput which was increased from 16MB/s to 100MB/s. These defaults =
will set the heap size to 6GB.
>>>=20
>>> Our schema is pretty simple; only 4 column families and each has one =
secondary index. The replication factor was set to four, and compression =
disabled. Our access patterns are intended to be about equal numbers of =
inserts and selects, with no updates, and the occasional delete.
>>>=20
>>> The first thing we did was begin to load data into the cluster. We =
could perform about 3000 inserts per second, which stayed mostly flat. =
Things started to go wrong around the time the nodes exceeded 800GB. =
Cassandra began to generate a lot of "mutations messages dropped" =
warnings, and was complaining that the heap was over 75% capacity.
>>>=20
>>> At that point, we stopped all activity on the cluster and attempted =
a repair. We did this so we could be sure that the data was fully =
consistent before continuing. Our mistake was probably trying to repair =
all of the nodes simultaneously - within an hour, Java terminated on one =
of the nodes with a heap out-of-memory message. I then increased all of =
the heap sizes to 8GB, and reduced the heap_newsize to 800MB. All of the =
nodes were restarted, and there was no no outside activity on the =
cluster. I then began a repair on a single node. Within a few hours, it =
OOMed again and exited. I then increased the heap to 12GB, and attempted =
the same thing. This time, the repair ran for about 7 hours before =
exiting from an OOM condition.
>>>=20
>>> By now, the repair had increased the amount of data on some of the =
nodes to over 1.2TB. There is no going back to a 6GB heap size - =
Cassandra now exits with an OOM during startup unless the heap is set =
higher. It's at 16GB now, and a single node has been repairing for a =
couple of days. Though I have no personal experience with this, I've =
been told that Java's garbage collector doesn't perform well with heaps =
above 8GB. I'm wary of setting it higher, but I can add up to 192GB of =
RAM to each node if necessary.
>>>=20
>>> How much heap does cassandra need for this amount of data with only =
four CFs? Am I scaling this cluster in completely the wrong direction? =
Is there a magic garbage collection setting that I need to add in =
cassandra-env that isn't there by default?
>>>=20
>>> Thanks,
>>>=20
>>> - .Dustin
>=20