Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (athena.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <AANLkTi=T4zU1xQOpqkNZ=mAm_kfh8rCU_mwKCaFZveYR@mail.gmail.com>
References: <AANLkTi=T4zU1xQOpqkNZ=mAm_kfh8rCU_mwKCaFZveYR@mail.gmail.com>
Date: Sun, 9 Jan 2011 12:54:40 -0600
Message-ID: <AANLkTin4DrKQpca00=Z+stbFGAZ-rybhoew-bozWP+sU@mail.gmail.com>
Subject: Re: A few quick questions to help me design a better schema..
From: Tyler Hobbs <tyler@riptano.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016e6dd97b1427f4304996e62c9

--0016e6dd97b1427f4304996e62c9
Content-Type: text/plain; charset=ISO-8859-1

>
> 1. ) If certain columns in a row get mutated too frequently or if new
> columns are added to the row frequently then does the reads of old columns
> that rarely get changed is also affected ? In other words, is the
> performance of reads of almost infrequently changing columns in a row where
> some columns are frequently updated/inserted, affected in any manner ?
>

Yes, the performance of reading columns that you haven't changed will still
be affected by changing other columns in the row.  Constantly updating a row
causes it to be split across multiple SSTables.  If you are asking for the
columns by name, you may not need to actually read any extra data from most
of the SSTables, but you will need to at least read the per-row Bloom Filter
on each (or read the index and scan a portion of the row for slices); this
costs one seek for each SSTable.


> 2. ) Are all columns inside a super column family, supercolumns or can they
> may be simple columns+supercolumns  as well ?
>

They are all super columns.  There is no mixing of column types.


> 3. ) When row cache is enabled and certain  columns of a row are read then
> will the entire row be put into the cache or just those read columns are put
> into cache?
>

The entire row will be put into the cache.  This is good motivation for
splitting timelines into multiple rows by a relatively low timespan if you
mainly read the very end of the timeline.  Note that there has been
discussion somewhere of allowing you to only cache the last N columns of a
row in the row cache.


> 4. ) Does the larger no of column families has any impact on the
> performance(I read about it somewhere)? Should information for a particular
> row key be split in multiple column families according to the specific query
> demands or should all data related to a particular row key be kept together
> in a single column family ?
>

A higher number of column families requires more memory to be used and more
compactions to occur.  I can't answer the rest of the question accurately
without more detail on the particular use case.


> 5. ) Are there any limitation of valueless column to consider. I read in a
> ppt   "Only works with <= 2B columns in 0.7 valueless colum". I could
> understand the meaning of this statement.
>

I believe this is referring to the 2 billion column limit per row.  In the
real world, you generally don't want to get anywhere near that many columns
in a single row.

- Tyler

--0016e6dd97b1427f4304996e62c9
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-le=
ft: 1ex;">1. ) If certain columns in a row get mutated too frequently or if=
 new columns are added to the row frequently then does the reads of old col=
umns that rarely get changed is also affected ? In other words, is the perf=
ormance of reads of almost infrequently changing columns in a row where som=
e columns are frequently updated/inserted, affected in any manner ? <br>
</blockquote><div><br>Yes, the performance of reading columns that you have=
n&#39;t changed will still be affected by changing other columns in the row=
.=A0 Constantly updating a row causes it to be split across multiple SSTabl=
es.=A0 If you are asking for the columns by name, you may not need to actua=
lly read any extra data from most of the SSTables, but you will need to at =
least read the per-row Bloom Filter on each (or read the index and scan a p=
ortion of the row for slices); this costs one seek for each SSTable.<br>
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8=
ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">2. ) Are=
 all columns inside a super column family, supercolumns or can they may be =
simple columns+supercolumns=A0 as well ? <br>
</blockquote><div><br>They are all super columns.=A0 There is no mixing of =
column types.<br>=A0</div><blockquote class=3D"gmail_quote" style=3D"margin=
: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-lef=
t: 1ex;">
3. ) When row cache is enabled and certain=A0 columns of a row are read the=
n will the entire row be put into the cache or just those read columns are =
put into cache?<br></blockquote><div><br>The entire row will be put into th=
e cache.=A0 This is good motivation for splitting timelines into multiple r=
ows by a relatively low timespan if you mainly read the very end of the tim=
eline.=A0 Note that there has been discussion somewhere of allowing you to =
only cache the last N columns of a row in the row cache.<br>
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8=
ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">4. ) Doe=
s the larger no of column families has any impact on the performance(I read=
 about it somewhere)? Should information for a particular row key be split =
in multiple column families according to the specific query demands or shou=
ld all data related to a particular row key be kept together in a single co=
lumn family ?<br>
</blockquote><div><br>A higher number of column families requires more memo=
ry to be used and more compactions to occur.=A0 I can&#39;t answer the rest=
 of the question accurately without more detail on the particular use case.=
<br>
=A0</div><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8=
ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">5. ) Are=
 there any limitation of valueless column to consider. I read in a ppt=A0=
=A0 &quot;Only works with &lt;=3D 2B columns in 0.7 valueless colum&quot;. =
I could understand the meaning of this statement.<br>
</blockquote><div><br>I believe this is referring to the 2 billion column l=
imit per row.=A0 In the real world, you generally don&#39;t want to get any=
where near that many columns in a single row.<br>=A0</div>- Tyler<br></div>

--0016e6dd97b1427f4304996e62c9--