Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 20455 invoked from network); 18 Dec 2010 17:58:38 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 18 Dec 2010 17:58:38 -0000 Received: (qmail 55457 invoked by uid 500); 18 Dec 2010 17:58:37 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 55434 invoked by uid 500); 18 Dec 2010 17:58:36 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 55426 invoked by uid 99); 18 Dec 2010 17:58:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Dec 2010 17:58:36 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [74.125.83.42] (HELO mail-gw0-f42.google.com) (74.125.83.42) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 18 Dec 2010 17:58:27 +0000 Received: by gwb20 with SMTP id 20so1038303gwb.29 for ; Sat, 18 Dec 2010 09:58:05 -0800 (PST) MIME-Version: 1.0 Received: by 10.150.144.16 with SMTP id r16mr4396542ybd.204.1292695084710; Sat, 18 Dec 2010 09:58:04 -0800 (PST) Sender: scode@scode.org Received: by 10.150.220.9 with HTTP; Sat, 18 Dec 2010 09:58:04 -0800 (PST) X-Originating-IP: [213.114.156.79] In-Reply-To: References: <5B3E740C-DB61-47AE-A8D6-9D8086A01422@gmx.net> Date: Sat, 18 Dec 2010 18:58:04 +0100 X-Google-Sender-Auth: 4SZcINJuE4JJtM4ZncFReV8ldGI Message-ID: Subject: Re: Read Latency Degradation From: Peter Schuller To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 > Smaller nodes just seem to fit the Cassandra architecture a lot better. We > can not use cloud instances, so the cost for us to go to <500gb nodes is > prohibitive. Cassandra lumps all processes on the node together into one > bucket, and that almost then requires a smaller node data set. There are no > regions, tablets, or partitions created to throttle compaction and prevent > huge data files. There are definitely some things to improve. I think what you have mentioned is covered, but if you feel you're hitting something which is not covered by the wiki page I mentioned in my previous post (http://wiki.apache.org/cassandra/LargeDataSetConsiderations), please do augment or say so. In your original post you said you went from 5 ms to 50 ms. Is this average latencies under load, or the latency of a single request absent other traffic and absent background compaction etc? If a single read is taking 50 ms for reasons that have nothing to do with other concurrent activity, that smells of something being wrong to me. Otherwise, is your primary concern worse latency/throughput during compactions/repairs, or just the overall throughput/latency during normal operation? > I have considered dropping the heap down to 8gb, but having pained through > many cmf in the past I thought the larger heap should help prevent the stop > the world gc. I'm not sure what got merged to 0.6.8, but you may way want to grab the JVM options from the 0.7 branch. In particular, the initial occuprancy triggering of CMS mark-sweep phases. Concurrent mode failures could just be because the CMS heuristics failed, rather than due to the heap legitimately being too small. If the heuristics are failing, maybe you do have the ability to lower the heap size if you change the CMS trigger. I recommend monitoring heap usage for that; look for the heap usage as it appears right after a CMS collection has completed to judge the "real" live set size. > Row cache is not an option for us. We expect going to disk, and key cache is > the only cache that can help speed things up a little. We have wide rows so > key cache is an un-expensive boost. Ok, makes sense. > This is why we schedule weekly major compaction. We update ALL rows every > day, often over-writing previous values. Ok - so you're definitely in a position to suffer more than most use cases from data being spread over multiple sstables. >> (5) In general the way I/O works, latency will skyrocket once you >> start saturating your disks. As long as you're significantly below >> full utilization of your disks, you'll see pretty stable and low >> latencies. As you approach full saturation, the latencies will tend to >> increase super-linearly. Once you're *above* saturation, your >> latencies skyrocket and reads are dropped because the rate cannot be >> sustained. This means that while latency is a great indicator to look >> at to judge what the current user perceived behavior is, it is *not* a >> good thing to look at to extrapolate resource demands or figure out >> how far you are from saturation / need for more hardware. >> > This we can see with munin. We throttle the read load to avoid that "wall". Do you have a sense of how many reads on disk you're taking per read request to the node? Do you have a sense of the size of the active set? A big question is going to be whether caching is effective at all, and how much additional caching would help. In any case, it would be interesting to know whether you are seeing more disk seeks per read than you "should". -- / Peter Schuller