Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
Sender: scode@scode.org
In-Reply-To: <AANLkTik1frr0C5RoNz=ndopr-f=nkMywr2JzXjpizwqy@mail.gmail.com>
References: <AANLkTim-ca44qcHMfSao4=j8QTgL3dP6xSQPdAN0vWgf@mail.gmail.com>
	<5B3E740C-DB61-47AE-A8D6-9D8086A01422@gmx.net>
	<AANLkTikD74JoDngm7mK5_9VCNafViAxt+_qYvy9gEkyT@mail.gmail.com>
	<AANLkTinSMyJ2y0steHJT93_jTMEoU0Yure3_biTm5VdN@mail.gmail.com>
	<AANLkTinpK7OPfd9hOStpT6c5W10veSbAK+8tkFJF-h_B@mail.gmail.com>
	<AANLkTimB3yU+BKeZDU428csdmZgHuUA0nr3KBk5hu5YF@mail.gmail.com>
	<AANLkTik1frr0C5RoNz=ndopr-f=nkMywr2JzXjpizwqy@mail.gmail.com>
Date: Sat, 18 Dec 2010 18:58:04 +0100
Message-ID: <AANLkTimeveQb=horMk-_RJtMghDDAaQoCrtSiqL=F-Wf@mail.gmail.com>
Subject: Re: Read Latency Degradation
From: Peter Schuller <peter.schuller@infidyne.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8

> Smaller nodes just seem to fit the Cassandra architecture a lot better. We
> can not use cloud instances, so the cost for us to go to <500gb nodes is
> prohibitive. Cassandra lumps all processes on the node together into one
> bucket, and that almost then requires a smaller node data set. There are no
> regions, tablets, or partitions created to throttle compaction and prevent
> huge data files.

There are definitely some things to improve. I think what you have
mentioned is covered, but if you feel you're hitting something which
is not covered by the wiki page I mentioned in my previous post
(http://wiki.apache.org/cassandra/LargeDataSetConsiderations), please
do augment or say so.

In your original post you said you went from 5 ms to 50 ms. Is this
average latencies under load, or the latency of a single request
absent other traffic and absent background compaction etc?

If a single read is taking 50 ms for reasons that have nothing to do
with other concurrent activity, that smells of something being wrong
to me.

Otherwise, is your primary concern worse latency/throughput during
compactions/repairs, or just the overall throughput/latency during
normal operation?

> I have considered dropping the heap down to 8gb, but having pained through
> many cmf in the past I thought the larger heap should help prevent the stop
> the world gc.

I'm not sure what got merged to 0.6.8, but you may way want to grab
the JVM options from the 0.7 branch. In particular, the initial
occuprancy triggering of CMS mark-sweep phases. Concurrent mode
failures could just be because the CMS heuristics failed, rather than
due to the heap legitimately being too small. If the heuristics are
failing, maybe you do have the ability to lower the heap size if you
change the CMS trigger. I recommend monitoring heap usage for that;
look for the heap usage as it appears right after a CMS collection has
completed to judge the "real" live set size.

> Row cache is not an option for us. We expect going to disk, and key cache is
> the only cache that can help speed things up a little. We have wide rows so
> key cache is an un-expensive boost.

Ok, makes sense.

> This is why we schedule weekly major compaction. We update ALL rows every
> day, often over-writing previous values.

Ok - so you're definitely in a position to suffer more than most use
cases from data being spread over multiple sstables.

>> (5) In general the way I/O works, latency will skyrocket once you
>> start saturating your disks. As long as you're significantly below
>> full utilization of your disks, you'll see pretty stable and low
>> latencies. As you approach full saturation, the latencies will tend to
>> increase super-linearly. Once you're *above* saturation, your
>> latencies skyrocket and reads are dropped because the rate cannot be
>> sustained. This means that while latency is a great indicator to look
>> at to judge what the current user perceived behavior is, it is *not* a
>> good thing to look at to extrapolate resource demands or figure out
>> how far you are from saturation / need for more hardware.
>>
> This we can see with munin. We throttle the read load to avoid that "wall".

Do you have a sense of how many reads on disk you're taking per read
request to the node? Do you have a sense of the size of the active
set? A big question is going to be whether caching is effective at
all, and how much additional caching would help.

In any case, it would be interesting to know whether you are seeing
more disk seeks per read than you "should".

-- 
/ Peter Schuller