Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
Sender: david@daotown.com
In-Reply-To: <BANLkTimY=L2+pK5gdZ0SEhvsbuGtfAYWOw@mail.gmail.com>
References: <BANLkTimDvu=z3Wh00b-LxYNU+qXu49kexA@mail.gmail.com>
	<BANLkTi==cwCZ95gXzy7uZ+p59vm8sC3tzA@mail.gmail.com>
	<BANLkTinqGPvzeKyX5EpMvGW5gHDjxvy4mQ@mail.gmail.com>
	<BANLkTinFQfACgC-WgCnU3jjuCKqBZhBQBg@mail.gmail.com>
	<BANLkTik=xDyk3TB1-Bkmy-t0qLFSCB3LVQ@mail.gmail.com>
	<BANLkTinMNBXeEzonws2FWVNeWXX9odAcUA@mail.gmail.com>
	<BANLkTima1ESv_MSxokgWBe-XkueSQjO6PQ@mail.gmail.com>
	<BANLkTimaSEWyhthzAuHrenvqAn4tp-qp7w@mail.gmail.com>
	<BANLkTi=ip0Qd1xnjN1Sn_SF5uA=dJ3p6WQ@mail.gmail.com>
	<BANLkTimY=L2+pK5gdZ0SEhvsbuGtfAYWOw@mail.gmail.com>
Date: Mon, 2 May 2011 13:05:19 +0300
Message-ID: <BANLkTi=Ae4Be1mNmYUgoEOmj0E39fzmBtw@mail.gmail.com>
Subject: Re: Combining all CFs into one big one
From: David Boxenhorn <david@taotown.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0023543a283c306da804a24829f1

--0023543a283c306da804a24829f1
Content-Type: text/plain; charset=ISO-8859-1

Wouldn't it be the case that the once-used rows in your batch process would
quickly be traded out of the cache, and replaced by frequently-used rows?
This would be the case even if your batch process goes on for a long time,
since caching is done on a row-by-row basis. In effect, it would mean that
part of your cache is taken up by the batch process, much as if you
dedicated a permanent cache to the batch - except that it isn't permanent,
so it's better!


On Mon, May 2, 2011 at 7:50 AM, Tyler Hobbs <tyler@datastax.com> wrote:

> If you had one big cache, wouldn't it be the case that it's mostly
>> populated with frequently accessed rows, and less populated with rarely
>> accessed rows?
>>
>
> Yes.
>
> In fact, wouldn't one big cache dynamically and automatically give you
>> exactly what you want? If you try to partition the same amount of memory
>> manually, by guesswork, among many tables, aren't you always going to do a
>> worse job?
>>
>
> Suppose you have one CF that's used constantly through interaction by
> users.  Suppose you have another CF that's only used periodically by a batch
> process, you tend to access most or all of the rows during the batch
> process, and it's too large to cache all of the rows.  Normally, you would
> dedicate cache space to the first CF as anything with human interaction
> tends to have good temporal locality and you want to keep latencies there
> low.  On the other hand, caching the second CF provides little to no real
> benefit.  When you combine these two CFs, every time your batch process
> runs, rows from the second CF will populate the cache and will cause
> eviction of rows from the first CF, even though having those rows in the
> cache provides little benefit to you.
>
> As another example, if you mix a CF with wide rows and a CF with small
> rows, you no longer have the option of using a row cache, even if it makes
> great sense for the small-row CF data.
>
> Knowledge of data and access patterns gives you a very good advantage when
> it comes to caching your data effectively.
>
>
> --
> Tyler Hobbs
> Software Engineer, DataStax <http://datastax.com/>
> Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
> Python client library
>
>

--0023543a283c306da804a24829f1
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Wouldn&#39;t it be the case that the once-used rows in you=
r batch process would quickly be traded out of the cache, and replaced by f=
requently-used rows? This would be the case even if your batch process goes=
 on for a long time, since caching is done on a row-by-row basis. In effect=
, it would mean that part of your cache is taken up by the batch process, m=
uch as if you dedicated a permanent cache to the batch - except that it isn=
&#39;t permanent, so it&#39;s better! <br>
<br><br><div class=3D"gmail_quote">On Mon, May 2, 2011 at 7:50 AM, Tyler Ho=
bbs <span dir=3D"ltr">&lt;<a href=3D"mailto:tyler@datastax.com">tyler@datas=
tax.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D=
"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padd=
ing-left: 1ex;">
<div class=3D"gmail_quote"><div class=3D"im"><blockquote class=3D"gmail_quo=
te" style=3D"margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204=
, 204); padding-left: 1ex;"><div>If you had one big cache, wouldn&#39;t it =
be the case that it&#39;s mostly populated with frequently accessed rows, a=
nd less populated with rarely accessed rows?<br>

</div></blockquote></div><div><br>Yes.<br> <br></div><div class=3D"im"><blo=
ckquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; border-le=
ft: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div dir=3D"ltr">In f=
act, wouldn&#39;t one big cache dynamically and automatically give you exac=
tly what you want? If you try to partition the same amount of memory manual=
ly, by guesswork, among many tables, aren&#39;t you always going to do a wo=
rse job? <br>

</div>
</blockquote></div></div><br>Suppose you have one CF that&#39;s used consta=
ntly through interaction by users.=A0 Suppose you have another CF that&#39;=
s only used periodically by a batch process, you tend to access most or all=
 of the rows during the batch process, and it&#39;s too large to cache all =
of the rows.=A0 Normally, you would dedicate cache space to the first CF as=
 anything with human interaction tends to have good temporal locality and y=
ou want to keep latencies there low.=A0 On the other hand, caching the seco=
nd CF provides little to no real benefit.=A0 When you combine these two CFs=
, every time your batch process runs, rows from the second CF will populate=
 the cache and will cause eviction of rows from the first CF, even though h=
aving those rows in the cache provides little benefit to you.<br clear=3D"a=
ll">

<br>As another example, if you mix a CF with wide rows and a CF with small =
rows, you no longer have the option of using a row cache, even if it makes =
great sense for the small-row CF data.<br><br>Knowledge of data and access =
patterns gives you a very good advantage when it comes to caching your data=
 effectively.<div>
<div></div><div class=3D"h5"><br>
<br>-- <br><font color=3D"#888888">Tyler Hobbs<span></span><br>
Software Engineer, <a href=3D"http://datastax.com/" target=3D"_blank">DataS=
tax</a><br>Maintainer of the <a href=3D"http://github.com/pycassa/pycassa" =
target=3D"_blank">pycassa</a> Cassandra Python client library<br></font><br=
>
</div></div></blockquote></div><br></div>

--0023543a283c306da804a24829f1--