cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Віталій Тимчишин <>
Subject Re: Practical node size limits
Date Wed, 05 Sep 2012 08:10:54 GMT
You can try increasing streaming throttle.

2012/9/4 Dustin Wenz <>

> I'm following up on this issue, which I've been monitoring for the last
> several weeks. I thought people might find my observations interesting.
> Ever since increasing the heap size to 64GB, we've had no OOM conditions
> that resulted in a JVM termination. Our nodes have around 2.5TB of data
> each, and the replication factor is four. IO on the cluster seems to be
> fine, though I haven't been paying particular attention to any GC hangs.
> The bottleneck now seems to be the repair time. If any node becomes too
> inconsistent, or needs to be replaced, the rebuilt time is over a week.
> That issue alone makes this cluster configuration unsuitable for production
> use.
>         - .Dustin
> On Jul 30, 2012, at 2:04 PM, Dustin Wenz <> wrote:
> > Thanks for the pointer! It sounds likely that's what I'm seeing. CFStats
> reports that the bloom filter size is currently several gigabytes. Is there
> any way to estimate how much heap space a repair would require? Is it a
> function of simply adding up the filter file sizes, plus some fraction of
> neighboring nodes?
> >
> > I'm still curious about the largest heap sizes that people are running
> with on their deployments. I'm considering increasing ours to 64GB (with
> 96GB physical memory) to see where that gets us. Would it be necessary to
> keep the young-gen size small to avoid long GC pauses? I also suspect that
> I may need to keep my memtable sizes small to avoid long flushes; maybe in
> the 1-2GB range.
> >
> >       - .Dustin
> >
> > On Jul 29, 2012, at 10:45 PM, Edward Capriolo <>
> wrote:
> >
> >> Yikes. You should read:
> >>
> >>
> >>
> >> Essentially what it sounds like your are now running into is this:
> >>
> >> The BloomFilters for each SSTable must exist in main memory. Repair
> >> tends to create some extra data which normally gets compacted away
> >> later.
> >>
> >> Your best bet is to temporarily raise the Xmx heap and adjust the
> >> index sampling size. If you need to save the data (if it is just test
> >> data you may want to give up and start fresh)
> >>
> >> Generally the issue with the large disk configurations it is hard to
> >> keep a good ram/disk ratio. Then most reads turn into disk seeks and
> >> the throughput is low. I get the vibe people believe large stripes are
> >> going to help Cassandra. The issue is that stripes generally only
> >> increase sequential throughput, but Cassandra is a random read system.
> >>
> >> How much ram/disk you need is case dependent but 1/5 ratio of RAM to
> >> disk is where I think most people want to be, unless their system is
> >> carrying SSD disks.
> >>
> >> Again you have to keep your bloom filters in java heap memory so and
> >> design that tries to create a quatrillion small rows is going to have
> >> memory issues as well.
> >>
> >> On Sun, Jul 29, 2012 at 10:40 PM, Dustin Wenz <>
> wrote:
> >>> I'm trying to determine if there are any practical limits on the
> amount of data that a single node can handle efficiently, and if so,
> whether I've hit that limit or not.
> >>>
> >>> We've just set up a new 7-node cluster with Cassandra 1.1.2 running
> under OpenJDK6. Each node is 12-core Xeon with 24GB of RAM and is connected
> to a stripe of 10 3TB disk mirrors (a total of 20 spindles each) and
> connected via dual SATA-3 interconnects. I can read and write around
> 900MB/s sequentially on the arrays. I started out with Cassandra tuned with
> all-default values, with the exception of the compaction throughput which
> was increased from 16MB/s to 100MB/s. These defaults will set the heap size
> to 6GB.
> >>>
> >>> Our schema is pretty simple; only 4 column families and each has one
> secondary index. The replication factor was set to four, and compression
> disabled. Our access patterns are intended to be about equal numbers of
> inserts and selects, with no updates, and the occasional delete.
> >>>
> >>> The first thing we did was begin to load data into the cluster. We
> could perform about 3000 inserts per second, which stayed mostly flat.
> Things started to go wrong around the time the nodes exceeded 800GB.
> Cassandra began to generate a lot of "mutations messages dropped" warnings,
> and was complaining that the heap was over 75% capacity.
> >>>
> >>> At that point, we stopped all activity on the cluster and attempted a
> repair. We did this so we could be sure that the data was fully consistent
> before continuing. Our mistake was probably trying to repair all of the
> nodes simultaneously - within an hour, Java terminated on one of the nodes
> with a heap out-of-memory message. I then increased all of the heap sizes
> to 8GB, and reduced the heap_newsize to 800MB. All of the nodes were
> restarted, and there was no no outside activity on the cluster. I then
> began a repair on a single node. Within a few hours, it OOMed again and
> exited. I then increased the heap to 12GB, and attempted the same thing.
> This time, the repair ran for about 7 hours before exiting from an OOM
> condition.
> >>>
> >>> By now, the repair had increased the amount of data on some of the
> nodes to over 1.2TB. There is no going back to a 6GB heap size - Cassandra
> now exits with an OOM during startup unless the heap is set higher. It's at
> 16GB now, and a single node has been repairing for a couple of days. Though
> I have no personal experience with this, I've been told that Java's garbage
> collector doesn't perform well with heaps above 8GB. I'm wary of setting it
> higher, but I can add up to 192GB of RAM to each node if necessary.
> >>>
> >>> How much heap does cassandra need for this amount of data with only
> four CFs? Am I scaling this cluster in completely the wrong direction? Is
> there a magic garbage collection setting that I need to add in
> cassandra-env that isn't there by default?
> >>>
> >>> Thanks,
> >>>
> >>> - .Dustin
> >

Best regards,
 Vitalii Tymchyshyn

View raw message