Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of shimi.k@gmail.com designates
 209.85.210.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=txuR0YKyup6w7T6bLwW/Vq5ghOpmlPLj31W2H3Lcp3f5Y4ssnfNbRR+Nmuesn5CuxS
         GcmECmQ0oFyjP7UR3xYfgUn9O753UWtyBNqq4QTRMFoqBBE8yL92enptxkJXKSCuZ9Er
         a/iTely+XvIZ0gO0BUns0hP/476rHngtY+uGM=
MIME-Version: 1.0
In-Reply-To: <BANLkTik=xDyk3TB1-Bkmy-t0qLFSCB3LVQ@mail.gmail.com>
References: <BANLkTimDvu=z3Wh00b-LxYNU+qXu49kexA@mail.gmail.com>
	<BANLkTi==cwCZ95gXzy7uZ+p59vm8sC3tzA@mail.gmail.com>
	<BANLkTinqGPvzeKyX5EpMvGW5gHDjxvy4mQ@mail.gmail.com>
	<BANLkTinFQfACgC-WgCnU3jjuCKqBZhBQBg@mail.gmail.com>
	<BANLkTik=xDyk3TB1-Bkmy-t0qLFSCB3LVQ@mail.gmail.com>
Date: Sun, 1 May 2011 21:58:57 +0300
Message-ID: <BANLkTinMNBXeEzonws2FWVNeWXX9odAcUA@mail.gmail.com>
Subject: Re: Combining all CFs into one big one
From: shimi <shimi.k@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001636e909e3c24f0c04a23b7f52

--001636e909e3c24f0c04a23b7f52
Content-Type: text/plain; charset=ISO-8859-1

On Sun, May 1, 2011 at 9:48 PM, Jake Luciani <jakers@gmail.com> wrote:

> If you have N column families you need N * memtable size of RAM to support
> this.  If that's not an option you can merge them into one as you suggest
> but then you will have much larger SSTables, slower compactions, etc.


> I don't necessarily agree with Tyler that the OS cache will be less
> effective... But I do agree that if the sizes of sstables are too large for
> you then more hardware is the solution...


If you merge CFs which are hardly accessed with one which are accessed
frequently, when you read the SSTable you load data that is hardly accessed
to the OS cache.

Another thing which you should be aware is that if you need to run any of
the nodetool cf tasks, and you really need it for a specific CF running it
on the specific CF is better and faster.

Shimi


>
>
> On Sun, May 1, 2011 at 1:24 PM, Tyler Hobbs <tyler@datastax.com> wrote:
>
>> When you have a high number of CFs, it's a good idea to consider merging
>> CFs with highly correlated access patterns and similar structure into one.
>> It is *not* a good idea to merge all of your CFs into one (unless they all
>> happen to meet this criteria). Here's why:
>>
>> Besides big compactions and long repairs that you can't break down into
>> smaller pieces, the main problem here is that your caching will become much
>> less efficient. The OS buffer cache will be less effective because rows from
>> all of the CFs will be interspersed in the SSTables. You will no longer be
>> able to tune the key or row cache to only cache frequently accessed data.
>> Both of these will tend to cause a serious increase in latency for your hot
>> data.
>>
>>> Shouldn't these kinds of problems be solved by Cassandra?
>>>
>> They are mainly solved by Cassandra's general solution to any performance
>> problem: the addition of more nodes. There are tickets open to improve
>> compaction strategies, put bounds on SSTable sizes, etc; for example,
>> https://issues.apache.org/jira/browse/CASSANDRA-1608 , but the addition
>> of more nodes is a reliable solution to problems of this nature.
>>
>> On Sun, May 1, 2011 at 7:28 AM, David Boxenhorn <david@taotown.com>wrote:
>>
>>> Shouldn't these kinds of problems be solved by Cassandra? Isn't there a
>>> maximum SSTable size?
>>>
>>> On Sun, May 1, 2011 at 3:24 PM, shimi <shimi.k@gmail.com> wrote:
>>>
>>>> Big sstables, long compactions, in major compaction you will need to
>>>> have free disk space in the size of all the sstables (which you should have
>>>> anyway).
>>>>
>>>> Shimi
>>>>
>>>>
>>>> On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn <david@taotown.com>wrote:
>>>>
>>>>> I'm having problems administering my cluster because I have too many
>>>>> CFs (~40).
>>>>>
>>>>> I'm thinking of combining them all into one big CF. I would prefix the
>>>>> current CF name to the keys, repeat the CF name in a column, and index the
>>>>> column (so I can loop over all rows, which I have to do sometimes, for some
>>>>> CFs).
>>>>>
>>>>> Can anyone think of any disadvantages to this approach?
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> Tyler Hobbs
>> Software Engineer, DataStax <http://datastax.com/>
>> Maintainer of the pycassa <http://github.com/pycassa/pycassa> Cassandra
>> Python client library
>>
>>
>
>
> --
> http://twitter.com/tjake
>

--001636e909e3c24f0c04a23b7f52
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><div class=3D"gmail_quote">On Sun, May 1, 2011 at 9:4=
8 PM, Jake Luciani <span dir=3D"ltr">&lt;<a href=3D"mailto:jakers@gmail.com=
">jakers@gmail.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quot=
e" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;"=
>
If you have N column families you need N * memtable size of RAM to support =
this. =A0If that&#39;s not an option you can merge them into one as you sug=
gest but then you will have much larger SSTables, slower compactions, etc. =
</blockquote>
<div>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;=
border-left:1px #ccc solid;padding-left:1ex;">I don&#39;t=A0necessarily=A0a=
gree with Tyler that the OS cache will be less effective... But I do agree =
that if the sizes of sstables are too large for you then more hardware is t=
he solution...</blockquote>
<div><br></div><meta http-equiv=3D"content-type" content=3D"text/html; char=
set=3Dutf-8">If you merge CFs which are hardly accessed with one which are =
accessed frequently, when you read the SSTable you load data that is hardly=
 accessed to the OS cache.</div>
<div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Another thi=
ng which you should be aware is that if you need to run any of the nodetool=
 cf tasks, and you really need it for a specific CF running it on the speci=
fic CF is better and faster.=A0</div>
<div class=3D"gmail_quote"><br></div><div class=3D"gmail_quote">Shimi<br><d=
iv>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bo=
rder-left:1px #ccc solid;padding-left:1ex;"><div><div></div><div class=3D"h=
5"><br>
<br><div class=3D"gmail_quote">On Sun, May 1, 2011 at 1:24 PM, Tyler Hobbs =
<span dir=3D"ltr">&lt;<a href=3D"mailto:tyler@datastax.com" target=3D"_blan=
k">tyler@datastax.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_q=
uote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1e=
x">

When you have a high number of CFs,=20
it&#39;s a good idea to consider merging CFs with highly correlated access=
=20
patterns and similar structure into one.  It is *not* a good idea to=20
merge all of your CFs into one (unless they all happen to meet this=20
criteria).  Here&#39;s why:<br>


<p>Besides big compactions and long repairs that you can&#39;t break down=
=20
into smaller pieces, the main problem here is that your caching will=20
become much less efficient.  The OS buffer cache will be less effective=20
because rows from all of the CFs will be interspersed in the SSTables. =20
You will no longer be able to tune the key or row cache to only cache=20
frequently accessed data.  Both of these will tend to cause a serious=20
increase in latency for your hot data.</p><div>

<blockquote style=3D"margin:0pt 0pt 0pt 0.8ex;border-left:1px solid rgb(204=
, 204, 204);padding-left:1ex" class=3D"gmail_quote"><p>Shouldn&#39;t these =
kinds of problems be solved by=20
Cassandra?</p></blockquote></div><div>They are mainly solved by Cassandra&#=
39;s general solution to=20
any performance problem: the addition of more nodes.  There are tickets=20
open to improve compaction strategies, put bounds on SSTable sizes, etc;
 for example, <a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-16=
08" target=3D"_blank">https://issues.apache.org/jira/browse/CASSANDRA-1608<=
/a> , but the addition of more nodes is a reliable solution to problems of =
this nature.<br>


</div><div><div></div><div><br><div class=3D"gmail_quote">On Sun, May 1, 20=
11 at 7:28 AM, David Boxenhorn <span dir=3D"ltr">&lt;<a href=3D"mailto:davi=
d@taotown.com" target=3D"_blank">david@taotown.com</a>&gt;</span> wrote:<br=
>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr">Shouldn&#39;t these kinds of problems be solved by Cassand=
ra? Isn&#39;t there a maximum SSTable size? <br><div><div></div><div><br><d=
iv class=3D"gmail_quote">On Sun, May 1, 2011 at 3:24 PM, shimi <span dir=3D=
"ltr">&lt;<a href=3D"mailto:shimi.k@gmail.com" target=3D"_blank">shimi.k@gm=
ail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt 0pt 0.8ex;border-=
left:1px solid rgb(204, 204, 204);padding-left:1ex"><div dir=3D"ltr">Big ss=
tables, long compactions, in major compaction you will need to have free di=
sk space in the size of all the sstables (which you should have anyway).<di=
v>


<br></div><div><font color=3D"#888888">Shimi</font><div><div></div><div><br=
><br><div class=3D"gmail_quote">
On Sun, May 1, 2011 at 2:03 PM, David Boxenhorn <span dir=3D"ltr">&lt;<a hr=
ef=3D"mailto:david@taotown.com" target=3D"_blank">david@taotown.com</a>&gt;=
</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0pt 0pt=
 0pt 0.8ex;border-left:1px solid rgb(204, 204, 204);padding-left:1ex">


<div dir=3D"ltr">I&#39;m having problems administering my cluster because I=
 have too many CFs (~40).<br><br>I&#39;m thinking of combining them all int=
o one big CF. I would prefix the current CF name to the keys, repeat the CF=
 name in a column, and index the column (so I can loop over all rows, which=
 I have to do sometimes, for some CFs).<br>


<br>Can anyone think of any disadvantages to this approach? <br><br></div>
</blockquote></div><br></div></div></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br><br clear=3D"all"><br></div></div><font color=3D"#88=
8888">-- <br><font color=3D"#888888">Tyler Hobbs<span></span><br>
Software Engineer, <a href=3D"http://datastax.com/" target=3D"_blank">DataS=
tax</a><br>Maintainer of the <a href=3D"http://github.com/pycassa/pycassa" =
target=3D"_blank">pycassa</a> Cassandra Python client library<br></font><br=
>
</font></blockquote></div><br><br clear=3D"all"><br></div></div><font color=
=3D"#888888">-- <br><a href=3D"http://twitter.com/tjake" target=3D"_blank">=
http://twitter.com/tjake</a><br>
</font></blockquote></div><br></div></div>

--001636e909e3c24f0c04a23b7f52--