Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of dsimeonov@gmail.com designates
 74.125.83.44 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=sKB3IyDL2qxHuikqFLMcWfyYjTRPMM8fEBex3r4qc8ycgwRctJ/NJptnGHv3vuoIEL
         x89OFQIg+s3aNf+9OJUfQaPgHATe/rUrPVzETQP/eKVjpraZO4tJLHMI4m9EYAsWLcjC
         KpcufMXAP7+xn7dBB0iV598y7et6v3EFZSOBg=
MIME-Version: 1.0
In-Reply-To: <y2kac68ab851004280531ybc71e66fr7b6fba42083a425d@mail.gmail.com>
References: <l2g548e33c91004280456ya03a1ca9vca6a5cb8ecf34a1c@mail.gmail.com>
	 <y2kac68ab851004280531ybc71e66fr7b6fba42083a425d@mail.gmail.com>
Date: Wed, 28 Apr 2010 16:11:21 +0300
Message-ID: <p2l548e33c91004280611ud264ead8o1b549f8774afe6f3@mail.gmail.com>
Subject: Re: question about how columns are deserialized in memory
From: =?UTF-8?B?0JTQsNC90LjQtdC7INCh0LjQvNC10L7QvdC+0LI=?=
 <dsimeonov@gmail.com>
To: user@cassandra.apache.org, sylvain@yakaz.com
Content-Type: multipart/alternative; boundary=0016e64718c2144eea04854bbf1f

--0016e64718c2144eea04854bbf1f
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Sylvain,
  Thank you very much! I still have some further questions, I didn't find
how row cache is being configured? Regarding the splitting of rows, I
understand that it is not so necessary, still I am curious whether it is
implementable by the client code.
Best regards, Daniel.

2010/4/28 Sylvain Lebresne <sylvain@yakaz.com>

> 2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5=
=D0=BE=D0=BD=D0=BE=D0=B2 <dsimeonov@gmail.com>:
> > Hi,
> >    I have a question about if a row in a Column Family has only columns
> > whether all of the columns are deserialized in memory if you need any o=
f
> > them? As I understood it is the case,
>
> No it's not. Only the columns you request are deserialized in memory. The
> only
> thing is that, as of now, during compaction the entire row will be
> deserialize at
> once. So it just have to still fit in memory. But depending of the
> typical size of
> your column, you can easily millions of columns in a row without it
> being a problem
> at all.
>
> >  and if the Column Family is super
> > Column Family, then only the Super Column (entire) is brought up in
> memory?
>
> Yes, that part is true. That is the problem with the current
> implementation of super
> columns. While you can have lots of column in one row, you probably
> don't want to
> have lots of columns in one super column (but it's no problem to have
> lots of super
> column in one row).
>
> > What about row cache, is it different than memtable?
>
> Be careful with row cache. If row cache is enable, then yes, any read
> in a row will read
> the entire row. So you typically don't want to use row cache in column
> family where rows
> have lots of columns (unless you always read all the columns in the
> row each time of
> course).
>
> > I have another one question, let's say there is only data to be inserte=
d
> and
> > a solution to it is to have columns to be added to rows in Column Famil=
y,
> is
> > it possible in Cassandra to split the row if certain threshold is
> reached,
> > say 100 columns per row, what if there are concurrent inserts?
>
> No, cassandra can't do that for you. But you should be okay with what
> you describe
> below. That is, if a given row corresponds to an hour of data, it will
> limit it's size.
> And again, the number of column in a row is not really limited as long as
> the
> overall size of the row fits easily in memory.
>
> > The original data model and use case is to insert timestamped data and =
to
> > make range queries. The original keys of CF rows were in the form of
> > <id>.<timestamp> and then a single column with data, OPP was used. This
> is
> > not an optimal solution, since nodes are hotter than others, I am
> thinking
> > of changing the model in the way to have keys like <id>.<year/month/day=
>
> and
> > then a list of columns with timestamps within this range and
> > RandomPartitioner or using OPP but preprocess part of the key with MD5,
> i.e.
> > the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the
> problem
> > is how to deal with large number of columns being inserted in a
> particular
> > row.
> > Thank you very much!
> > Best regards, Daniel.
>

--0016e64718c2144eea04854bbf1f
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi Sylvain,<div>=C2=A0=C2=A0Thank you very much! I still have some further =
questions, I didn&#39;t find how row cache is being configured? Regarding t=
he splitting of rows, I understand that it is not so necessary, still I am =
curious whether it is implementable by the client code.=C2=A0<br>
<div>Best regards, Daniel.<br><br><div class=3D"gmail_quote">2010/4/28 Sylv=
ain Lebresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@yakaz.com">syl=
vain@yakaz.com</a>&gt;</span><br><blockquote class=3D"gmail_quote" style=3D=
"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5=D0=
=BE=D0=BD=D0=BE=D0=B2 &lt;<a href=3D"mailto:dsimeonov@gmail.com">dsimeonov@=
gmail.com</a>&gt;:<br>
<div class=3D"im">&gt; Hi,<br>
&gt; =C2=A0=C2=A0 I have a question about if a row in a Column Family has o=
nly columns<br>
&gt; whether all of the columns are deserialized in memory if you need any =
of<br>
&gt; them? As I understood it is the case,<br>
<br>
</div>No it&#39;s not. Only the columns you request are deserialized in mem=
ory. The only<br>
thing is that, as of now, during compaction the entire row will be<br>
deserialize at<br>
once. So it just have to still fit in memory. But depending of the<br>
typical size of<br>
your column, you can easily millions of columns in a row without it<br>
being a problem<br>
at all.<br>
<div class=3D"im"><br>
&gt; =C2=A0and if the Column Family is super<br>
&gt; Column Family, then only the Super Column (entire) is brought up in me=
mory?<br>
<br>
</div>Yes, that part is true. That is the problem with the current<br>
implementation of super<br>
columns. While you can have lots of column in one row, you probably<br>
don&#39;t want to<br>
have lots of columns in one super column (but it&#39;s no problem to have<b=
r>
lots of super<br>
column in one row).<br>
<div class=3D"im"><br>
&gt; What about row cache, is it different than memtable?<br>
<br>
</div>Be careful with row cache. If row cache is enable, then yes, any read=
<br>
in a row will read<br>
the entire row. So you typically don&#39;t want to use row cache in column<=
br>
family where rows<br>
have lots of columns (unless you always read all the columns in the<br>
row each time of<br>
course).<br>
<div class=3D"im"><br>
&gt; I have another one question, let&#39;s say there is only data to be in=
serted and<br>
&gt; a solution to it is to have columns to be added to rows in Column Fami=
ly, is<br>
&gt; it possible in=C2=A0Cassandra=C2=A0to split the row if certain=C2=A0th=
reshold=C2=A0is reached,<br>
&gt; say 100 columns per row, what if there are=C2=A0concurrent=C2=A0insert=
s?<br>
<br>
</div>No, cassandra can&#39;t do that for you. But you should be okay with =
what<br>
you describe<br>
below. That is, if a given row corresponds to an hour of data, it will<br>
limit it&#39;s size.<br>
And again, the number of column in a row is not really limited as long as t=
he<br>
overall size of the row fits easily in memory.<br>
<div><div></div><div class=3D"h5"><br>
&gt; The original data model and use case is to insert timestamped data and=
 to<br>
&gt; make range queries. The original keys of CF rows were in the form of<b=
r>
&gt; &lt;id&gt;.&lt;timestamp&gt; and then a single column with data, OPP w=
as used. This is<br>
&gt; not an optimal solution, since nodes are hotter than others, I am thin=
king<br>
&gt; of changing the model in the way to have keys like &lt;id&gt;.&lt;year=
/month/day&gt; and<br>
&gt; then a list of columns with timestamps within this range and<br>
&gt; RandomPartitioner or using OPP but preprocess part of the key with MD5=
, i.e.<br>
&gt; the key is MD5(&lt;id&gt;.&lt;year/month/day&gt;) + &quot;hour of the =
day&quot; . Just the problem<br>
&gt; is how to deal with large number of columns being inserted in a partic=
ular<br>
&gt; row.<br>
&gt; Thank you very much!<br>
&gt; Best regards, Daniel.<br>
</div></div></blockquote></div><br></div></div>

--0016e64718c2144eea04854bbf1f--