Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D6B8B39A9 for ; Mon, 2 May 2011 10:05:48 +0000 (UTC) Received: (qmail 53745 invoked by uid 500); 2 May 2011 10:05:46 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 53717 invoked by uid 500); 2 May 2011 10:05:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 53709 invoked by uid 99); 2 May 2011 10:05:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 May 2011 10:05:46 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.215.44] (HELO mail-ew0-f44.google.com) (209.85.215.44) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 02 May 2011 10:05:41 +0000 Received: by ewy19 with SMTP id 19so2064444ewy.31 for ; Mon, 02 May 2011 03:05:19 -0700 (PDT) MIME-Version: 1.0 Received: by 10.14.51.4 with SMTP id a4mr3322716eec.82.1304330719226; Mon, 02 May 2011 03:05:19 -0700 (PDT) Sender: david@daotown.com Received: by 10.14.37.79 with HTTP; Mon, 2 May 2011 03:05:19 -0700 (PDT) X-Originating-IP: [62.90.201.82] In-Reply-To: References: Date: Mon, 2 May 2011 13:05:19 +0300 X-Google-Sender-Auth: JCpKbACLog3rZCcO_7Eu0Q_O704 Message-ID: Subject: Re: Combining all CFs into one big one From: David Boxenhorn To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0023543a283c306da804a24829f1 --0023543a283c306da804a24829f1 Content-Type: text/plain; charset=ISO-8859-1 Wouldn't it be the case that the once-used rows in your batch process would quickly be traded out of the cache, and replaced by frequently-used rows? This would be the case even if your batch process goes on for a long time, since caching is done on a row-by-row basis. In effect, it would mean that part of your cache is taken up by the batch process, much as if you dedicated a permanent cache to the batch - except that it isn't permanent, so it's better! On Mon, May 2, 2011 at 7:50 AM, Tyler Hobbs wrote: > If you had one big cache, wouldn't it be the case that it's mostly >> populated with frequently accessed rows, and less populated with rarely >> accessed rows? >> > > Yes. > > In fact, wouldn't one big cache dynamically and automatically give you >> exactly what you want? If you try to partition the same amount of memory >> manually, by guesswork, among many tables, aren't you always going to do a >> worse job? >> > > Suppose you have one CF that's used constantly through interaction by > users. Suppose you have another CF that's only used periodically by a batch > process, you tend to access most or all of the rows during the batch > process, and it's too large to cache all of the rows. Normally, you would > dedicate cache space to the first CF as anything with human interaction > tends to have good temporal locality and you want to keep latencies there > low. On the other hand, caching the second CF provides little to no real > benefit. When you combine these two CFs, every time your batch process > runs, rows from the second CF will populate the cache and will cause > eviction of rows from the first CF, even though having those rows in the > cache provides little benefit to you. > > As another example, if you mix a CF with wide rows and a CF with small > rows, you no longer have the option of using a row cache, even if it makes > great sense for the small-row CF data. > > Knowledge of data and access patterns gives you a very good advantage when > it comes to caching your data effectively. > > > -- > Tyler Hobbs > Software Engineer, DataStax > Maintainer of the pycassa Cassandra > Python client library > > --0023543a283c306da804a24829f1 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Wouldn't it be the case that the once-used rows in you= r batch process would quickly be traded out of the cache, and replaced by f= requently-used rows? This would be the case even if your batch process goes= on for a long time, since caching is done on a row-by-row basis. In effect= , it would mean that part of your cache is taken up by the batch process, m= uch as if you dedicated a permanent cache to the batch - except that it isn= 't permanent, so it's better!


On Mon, May 2, 2011 at 7:50 AM, Tyler Ho= bbs <tyler@datas= tax.com> wrote:
If you had one big cache, wouldn't it = be the case that it's mostly populated with frequently accessed rows, a= nd less populated with rarely accessed rows?

Yes.

In f= act, wouldn't one big cache dynamically and automatically give you exac= tly what you want? If you try to partition the same amount of memory manual= ly, by guesswork, among many tables, aren't you always going to do a wo= rse job?

Suppose you have one CF that's used consta= ntly through interaction by users.=A0 Suppose you have another CF that'= s only used periodically by a batch process, you tend to access most or all= of the rows during the batch process, and it's too large to cache all = of the rows.=A0 Normally, you would dedicate cache space to the first CF as= anything with human interaction tends to have good temporal locality and y= ou want to keep latencies there low.=A0 On the other hand, caching the seco= nd CF provides little to no real benefit.=A0 When you combine these two CFs= , every time your batch process runs, rows from the second CF will populate= the cache and will cause eviction of rows from the first CF, even though h= aving those rows in the cache provides little benefit to you.

As another example, if you mix a CF with wide rows and a CF with small = rows, you no longer have the option of using a row cache, even if it makes = great sense for the small-row CF data.

Knowledge of data and access = patterns gives you a very good advantage when it comes to caching your data= effectively.


--
Tyler Hobbs
Software Engineer, DataS= tax
Maintainer of the pycassa Cassandra Python client library

--0023543a283c306da804a24829f1--