Return-Path: X-Original-To: apmail-cassandra-dev-archive@www.apache.org Delivered-To: apmail-cassandra-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 66F0B95C0 for ; Thu, 28 Jun 2012 18:39:44 +0000 (UTC) Received: (qmail 46219 invoked by uid 500); 28 Jun 2012 18:39:41 -0000 Delivered-To: apmail-cassandra-dev-archive@cassandra.apache.org Received: (qmail 46016 invoked by uid 500); 28 Jun 2012 18:39:40 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Received: (qmail 46000 invoked by uid 99); 28 Jun 2012 18:39:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jun 2012 18:39:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jwijgerd@gmail.com designates 209.85.217.172 as permitted sender) Received: from [209.85.217.172] (HELO mail-lb0-f172.google.com) (209.85.217.172) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 28 Jun 2012 18:39:36 +0000 Received: by lbbgo11 with SMTP id go11so3759333lbb.31 for ; Thu, 28 Jun 2012 11:39:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=s4AeOBXPcrMdm3vlxQ06q4iMNL3heH3tPSjegT0rlqg=; b=sozDRAsnCM7WaJJtryDBaMGkm1ZdiDBYhgPeUnPkqvnQ3pS2+it05zD67o21qcaNia h6+nIUHhHC5ZimdEhxOBVJYwEajh+8dUt2bbxYZHo/pC3jxV7Efm5pbfHc+5nQ/Ut0TB qK563Jh7wlNtHBtLTCZP/YBesGgQWx2yZaS0TGVKu071rkgFl0B+S7x94BRe690yCwPf lJR6y785iVs8d41pNo5nUfUBRxzPTRjdF1P27hHHDOFfxIuDbx4wJ2v0cifKSBkAu3dg H+uFHIARvLqhcpgcugnojexA5s4t8z7OTHdFvQUDd5kuBv3L5rFmPhNc8q/QV1lTTOHe 6FCA== MIME-Version: 1.0 Received: by 10.112.28.137 with SMTP id b9mr1578213lbh.99.1340908754267; Thu, 28 Jun 2012 11:39:14 -0700 (PDT) Received: by 10.112.89.113 with HTTP; Thu, 28 Jun 2012 11:39:14 -0700 (PDT) In-Reply-To: References: Date: Thu, 28 Jun 2012 20:39:14 +0200 Message-ID: Subject: Re: Memtable tuning in 1.0 and higher From: Joost van de Wijgerd To: dev@cassandra.apache.org Cc: user Content-Type: multipart/alternative; boundary=bcaec554d844f96e2104c38ca5b8 X-Virus-Checked: Checked by ClamAV on apache.org --bcaec554d844f96e2104c38ca5b8 Content-Type: text/plain; charset=ISO-8859-1 Hi Jonathan, The problem is not that I haven't allocated enough memory to the memtables, the problem is that the memtable for this particular CF flushes too early because the liveRation stays at 1.0. When looking at the code it seems to me the the liveRatio is calculated based on the throughput of the memtable and the size in memory of the memtable, its this line of code: double newRatio = (double) deepSize / currentThroughput.get(); the currentThoughput is increased even before the data is merged into the memtable so it is actually measuring the throughput afaik. Whereas deepSize is the size of the current memtable in memory. On a overwrite heavy workload you will get values below 1.0 (which are correct). This leads to the size in memory being wrongly calculated and my CF being flushed way to early. Because I can only optimize memtable_total_space_in_mb on a global scale only I cannot do this too aggressively because I have other CFs that don't have this heavy overwrite workload. I don't agree with you that flushing every 5 to 6 minutes is a good thing for this particular CF. Since it has so many overwrites I now have to do 10x more compactions while before this work was all done in memory. Also, because I read single columns for my particular use case (fail over, scale out of session data) my hot sessions will always be in the memtable so they load really fast because cassandra doesn't have to look at the sstables for a single column read. To me it feels like the current way of tuning memtables is really a step back, I has my system tuned quite nicely in 0.7.x but now I have lost almost all of the fine grained config I has at my disposal. I agree that for the basic use case this might be good and conservative, but it would be really nice if the possibility of really fine tuning and optimizing for a certain workload would exist also. I am either interpreting this code and the behavior I see in my system totally wrong, or otherwise I hope you guys can see my issue and maybe in the future can provide something for it. In the meantime I am going to patch my Memtable to be able to handle liveRatios lower than 1.0 and run some tests on it. Kind Regards Joost On Thu, Jun 28, 2012 at 6:44 PM, Jonathan Ellis wrote: > [moving to user list] > > 1.0 doesn't care about throughput or op count anymore, only whether > the total memory used by the *currrent* data in the memtables has > reached the global limit. So, it automatically doesn't count > "historical" data that's been overwritten in the current memtable. > > So, you may want to increase the memory allocated to memtables... or > you may be seeing flushes forced by the commitlog size cap, which you > can also adjust. > > But, the bottom line is I'd consider flushing every 5-6 minutes to be > quite healthy; since the amount of "time flushing" : "time not > flushing" ratio is quite small, reducing it further is going to give > you negligible benefit (in exchange for longer replay times.) > > On Thu, Jun 28, 2012 at 5:09 AM, Joost van de Wijgerd > wrote: > > Hi, > > > > I work for eBuddy, We've been using Cassandra in production since 0.6 > > (using 0.7 and 1.0, skipped 0.8) and use it for several Use Cases. One of > > our uses is to persist our sessions. > > > > Some background, in our case sessions are long lived, we have a mobile > > messaging platform where sessions are essentially eternal. We use > cassandra > > as a system of record for our session so in case of scale out or fail > over > > we can quickly load the session state again. We use protocolbuffers to > > serailize > > our data into a byte buffer and then store this as a column value in a > > (wide) row. We use a partition based approach to scale and each partition > > has it's own > > row in cassandra. Each session is mapped to a partition and stored in a > > column in this row. > > > > Every time there is a change in the session (i.e. message add, acked etc) > > we schedule the session to be flushed to cassandra. Every x seconds we > flush > > the dirty sessions. So there are a serious number of (over)writes going > on > > and not that many reads (unless there is a failover situation or we scale > > out). This > > is using one of the strengths of cassandra. > > > > In versions 0.6 and 0.7 it was possible to control the memtable settings > on > > a CF basis. So for this particular CF we would set the throughput really > > high since there > > are a huge number of overwrites. In the same cluster we have other CFs > that > > have a different load pattern. > > > > Since we moved to version 1.0 however, it has become almost impossible to > > tune our system for this (mixed) workload. Since we now have only two > knobs > > to turn (the size > > of the commit log and the total memtable size) and you have introduced > the > > liveRation calculation. While this works ok for most workloads, our > > persistent session store > > is really hurt by the fact that the liveRatio cannot be lower than 1.0 > > > > We generally have an actual liveRatio of 0.025 on this CF due to the huge > > number of overwrites. We are now artificially tuning up the total > memtable > > size but this interferes > > with our other CFs who have a different workload. Due to this, our > > performance has degraded quite a bit since on our 0.7 version we had our > > session CF tuned so that > > it would flush only once an hour, thus absorbing way more overwrites, > thus > > having to do less compactions and on a failover scenario most request > could > > be served straight > > from the memtable (since we are doing since column reads there). > Currently > > we flush every 5 to 6 minutes under moderate load, so 10 times worse. > This > > is with the s > > same heap setting etc. > > > > Would you guys consider allowing lower values than 1.0 for the liveRatio > > calculation? This would help us a lot. Perhaps make it a flag so it can > be > > turned on and off? Ideally > > I would like the possibility back to tune on a CF by CF basis, this could > > be a special setting that needs to be enabled for power users. The > default > > being what's there now. > > > > Also, in the current version the live ration can never adjust downwards, > I > > see you guys have already made a fix for this in 1.1 but I have not seen > it > > on the 1.0 branch. > > > > Let me know what you think > > > > Kind regards, > > > > Joost > > > > -- > > Joost van de Wijgerd > > joost.van.de.wijgerd@Skype > > http://www.linkedin.com/in/jwijgerd > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com > -- Joost van de Wijgerd Visseringstraat 21B 1051KH Amsterdam +31624111401 joost.van.de.wijgerd@Skype http://www.linkedin.com/in/jwijgerd --bcaec554d844f96e2104c38ca5b8--