Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of dsimeonov@gmail.com
 designates 209.85.211.195 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=chYsHKSOV0csWnj6V3C6c2dYFZ6NI2yjSUVmrOgrRarLgvZHQzarBInTogGV/BqIlX
         mnk61A+3NlHS6/ba10GIqHLfj3M223h8A3ykGce800pRUNS0/FK4P8WNDkhw9zMWfoCl
         2s4ZDNPa0go4tgTHxYugI/EQwdCcD61ZjGPVc=
MIME-Version: 1.0
In-Reply-To: <m2vac68ab851004280636w972b2313l67bb48b2590f9136@mail.gmail.com>
References: <l2g548e33c91004280456ya03a1ca9vca6a5cb8ecf34a1c@mail.gmail.com>
	 <y2kac68ab851004280531ybc71e66fr7b6fba42083a425d@mail.gmail.com>
	 <p2l548e33c91004280611ud264ead8o1b549f8774afe6f3@mail.gmail.com>
	 <m2vac68ab851004280636w972b2313l67bb48b2590f9136@mail.gmail.com>
Date: Wed, 28 Apr 2010 18:36:05 +0300
Message-ID: <m2w548e33c91004280836se48c99ek12c54e00d93ca257@mail.gmail.com>
Subject: Re: question about how columns are deserialized in memory
From: =?UTF-8?B?0JTQsNC90LjQtdC7INCh0LjQvNC10L7QvdC+0LI=?=
 <dsimeonov@gmail.com>
To: Sylvain Lebresne <sylvain@yakaz.com>, user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001636c92874b0246004854dc439

--001636c92874b0246004854dc439
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,
  What about if the upper bound of columns in a row is loosely defined, i.e=
.
it is ok that we have maximum of around 100 for example, but not exactly
(maybe 105, 110)?
What if I make a slice query to return say 1/5th of the columns in a row, I
believe that such query again will not deserialize all columns in memory?
Best regards, Daniel.

2010/4/28 Sylvain Lebresne <sylvain@yakaz.com>

> 2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5=
=D0=BE=D0=BD=D0=BE=D0=B2 <dsimeonov@gmail.com>:
> > Hi Sylvain,
> >   Thank you very much! I still have some further questions, I didn't fi=
nd
> > how row cache is being configured?
>
> Provided you don't use trunk but something stable like 0.6.1 (which
> you should),
> it is in storage-conf.xml. It's one option of the definition of the
> column families (it
> is documented in the file).
>
> > Regarding the splitting of rows, I
> > understand that it is not so necessary, still I am curious whether it i=
s
> > implementable by the client code.
>
> Well, I'm not sure there is any simple way to do it (at least not
> efficiently). Counting
> the number of columns in a row is expensive plus there is no easy way
> to implement
> counter in cassandra (even though
> https://issues.apache.org/jira/browse/CASSANDRA-580
> will make that better someday).
>
> > Best regards, Daniel.
> >
> > 2010/4/28 Sylvain Lebresne <sylvain@yakaz.com>
> >>
> >> 2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=
=B5=D0=BE=D0=BD=D0=BE=D0=B2 <dsimeonov@gmail.com>:
> >> > Hi,
> >> >    I have a question about if a row in a Column Family has only
> columns
> >> > whether all of the columns are deserialized in memory if you need an=
y
> of
> >> > them? As I understood it is the case,
> >>
> >> No it's not. Only the columns you request are deserialized in memory.
> The
> >> only
> >> thing is that, as of now, during compaction the entire row will be
> >> deserialize at
> >> once. So it just have to still fit in memory. But depending of the
> >> typical size of
> >> your column, you can easily millions of columns in a row without it
> >> being a problem
> >> at all.
> >>
> >> >  and if the Column Family is super
> >> > Column Family, then only the Super Column (entire) is brought up in
> >> > memory?
> >>
> >> Yes, that part is true. That is the problem with the current
> >> implementation of super
> >> columns. While you can have lots of column in one row, you probably
> >> don't want to
> >> have lots of columns in one super column (but it's no problem to have
> >> lots of super
> >> column in one row).
> >>
> >> > What about row cache, is it different than memtable?
> >>
> >> Be careful with row cache. If row cache is enable, then yes, any read
> >> in a row will read
> >> the entire row. So you typically don't want to use row cache in column
> >> family where rows
> >> have lots of columns (unless you always read all the columns in the
> >> row each time of
> >> course).
> >>
> >> > I have another one question, let's say there is only data to be
> inserted
> >> > and
> >> > a solution to it is to have columns to be added to rows in Column
> >> > Family, is
> >> > it possible in Cassandra to split the row if certain threshold is
> >> > reached,
> >> > say 100 columns per row, what if there are concurrent inserts?
> >>
> >> No, cassandra can't do that for you. But you should be okay with what
> >> you describe
> >> below. That is, if a given row corresponds to an hour of data, it will
> >> limit it's size.
> >> And again, the number of column in a row is not really limited as long
> as
> >> the
> >> overall size of the row fits easily in memory.
> >>
> >> > The original data model and use case is to insert timestamped data a=
nd
> >> > to
> >> > make range queries. The original keys of CF rows were in the form of
> >> > <id>.<timestamp> and then a single column with data, OPP was used.
> This
> >> > is
> >> > not an optimal solution, since nodes are hotter than others, I am
> >> > thinking
> >> > of changing the model in the way to have keys like
> <id>.<year/month/day>
> >> > and
> >> > then a list of columns with timestamps within this range and
> >> > RandomPartitioner or using OPP but preprocess part of the key with
> MD5,
> >> > i.e.
> >> > the key is MD5(<id>.<year/month/day>) + "hour of the day" . Just the
> >> > problem
> >> > is how to deal with large number of columns being inserted in a
> >> > particular
> >> > row.
> >> > Thank you very much!
> >> > Best regards, Daniel.
> >
> >
>

--001636c92874b0246004854dc439
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi,<div>=C2=A0=C2=A0What about if the upper bound of columns in a row is lo=
osely defined, i.e. it is ok that we have maximum of around 100 for example=
, but not exactly (maybe 105, 110)?=C2=A0</div><div>What if I make a slice =
query to return say 1/5th of the columns in a row, I believe that such quer=
y again will not deserialize all columns in memory?<br>
<div>Best regards, Daniel.<br><br><div class=3D"gmail_quote">2010/4/28 Sylv=
ain Lebresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@yakaz.com">syl=
vain@yakaz.com</a>&gt;</span><br><blockquote class=3D"gmail_quote" style=3D=
"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
<div class=3D"im">2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=
=B8=D0=BC=D0=B5=D0=BE=D0=BD=D0=BE=D0=B2 &lt;<a href=3D"mailto:dsimeonov@gma=
il.com">dsimeonov@gmail.com</a>&gt;:<br>
</div><div class=3D"im">&gt; Hi Sylvain,<br>
&gt; =C2=A0=C2=A0Thank you very much! I still have some further questions, =
I didn&#39;t find<br>
&gt; how row cache is being configured?<br>
<br>
</div>Provided you don&#39;t use trunk but something stable like 0.6.1 (whi=
ch<br>
you should),<br>
it is in storage-conf.xml. It&#39;s one option of the definition of the<br>
column families (it<br>
is documented in the file).<br>
<div class=3D"im"><br>
&gt; Regarding the splitting of rows, I<br>
&gt; understand that it is not so necessary, still I am curious whether it =
is<br>
&gt; implementable by the client code.<br>
<br>
</div>Well, I&#39;m not sure there is any simple way to do it (at least not=
<br>
efficiently). Counting<br>
the number of columns in a row is expensive plus there is no easy way<br>
to implement<br>
counter in cassandra (even though<br>
<a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-580" target=3D"_=
blank">https://issues.apache.org/jira/browse/CASSANDRA-580</a><br>
will make that better someday).<br>
<div><div></div><div class=3D"h5"><br>
&gt; Best regards, Daniel.<br>
&gt;<br>
&gt; 2010/4/28 Sylvain Lebresne &lt;<a href=3D"mailto:sylvain@yakaz.com">sy=
lvain@yakaz.com</a>&gt;<br>
&gt;&gt;<br>
&gt;&gt; 2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=
=D0=B5=D0=BE=D0=BD=D0=BE=D0=B2 &lt;<a href=3D"mailto:dsimeonov@gmail.com">d=
simeonov@gmail.com</a>&gt;:<br>
&gt;&gt; &gt; Hi,<br>
&gt;&gt; &gt; =C2=A0=C2=A0 I have a question about if a row in a Column Fam=
ily has only columns<br>
&gt;&gt; &gt; whether all of the columns are deserialized in memory if you =
need any of<br>
&gt;&gt; &gt; them? As I understood it is the case,<br>
&gt;&gt;<br>
&gt;&gt; No it&#39;s not. Only the columns you request are deserialized in =
memory. The<br>
&gt;&gt; only<br>
&gt;&gt; thing is that, as of now, during compaction the entire row will be=
<br>
&gt;&gt; deserialize at<br>
&gt;&gt; once. So it just have to still fit in memory. But depending of the=
<br>
&gt;&gt; typical size of<br>
&gt;&gt; your column, you can easily millions of columns in a row without i=
t<br>
&gt;&gt; being a problem<br>
&gt;&gt; at all.<br>
&gt;&gt;<br>
&gt;&gt; &gt; =C2=A0and if the Column Family is super<br>
&gt;&gt; &gt; Column Family, then only the Super Column (entire) is brought=
 up in<br>
&gt;&gt; &gt; memory?<br>
&gt;&gt;<br>
&gt;&gt; Yes, that part is true. That is the problem with the current<br>
&gt;&gt; implementation of super<br>
&gt;&gt; columns. While you can have lots of column in one row, you probabl=
y<br>
&gt;&gt; don&#39;t want to<br>
&gt;&gt; have lots of columns in one super column (but it&#39;s no problem =
to have<br>
&gt;&gt; lots of super<br>
&gt;&gt; column in one row).<br>
&gt;&gt;<br>
&gt;&gt; &gt; What about row cache, is it different than memtable?<br>
&gt;&gt;<br>
&gt;&gt; Be careful with row cache. If row cache is enable, then yes, any r=
ead<br>
&gt;&gt; in a row will read<br>
&gt;&gt; the entire row. So you typically don&#39;t want to use row cache i=
n column<br>
&gt;&gt; family where rows<br>
&gt;&gt; have lots of columns (unless you always read all the columns in th=
e<br>
&gt;&gt; row each time of<br>
&gt;&gt; course).<br>
&gt;&gt;<br>
&gt;&gt; &gt; I have another one question, let&#39;s say there is only data=
 to be inserted<br>
&gt;&gt; &gt; and<br>
&gt;&gt; &gt; a solution to it is to have columns to be added to rows in Co=
lumn<br>
&gt;&gt; &gt; Family, is<br>
&gt;&gt; &gt; it possible in=C2=A0Cassandra=C2=A0to split the row if certai=
n=C2=A0threshold=C2=A0is<br>
&gt;&gt; &gt; reached,<br>
&gt;&gt; &gt; say 100 columns per row, what if there are=C2=A0concurrent=C2=
=A0inserts?<br>
&gt;&gt;<br>
&gt;&gt; No, cassandra can&#39;t do that for you. But you should be okay wi=
th what<br>
&gt;&gt; you describe<br>
&gt;&gt; below. That is, if a given row corresponds to an hour of data, it =
will<br>
&gt;&gt; limit it&#39;s size.<br>
&gt;&gt; And again, the number of column in a row is not really limited as =
long as<br>
&gt;&gt; the<br>
&gt;&gt; overall size of the row fits easily in memory.<br>
&gt;&gt;<br>
&gt;&gt; &gt; The original data model and use case is to insert timestamped=
 data and<br>
&gt;&gt; &gt; to<br>
&gt;&gt; &gt; make range queries. The original keys of CF rows were in the =
form of<br>
&gt;&gt; &gt; &lt;id&gt;.&lt;timestamp&gt; and then a single column with da=
ta, OPP was used. This<br>
&gt;&gt; &gt; is<br>
&gt;&gt; &gt; not an optimal solution, since nodes are hotter than others, =
I am<br>
&gt;&gt; &gt; thinking<br>
&gt;&gt; &gt; of changing the model in the way to have keys like &lt;id&gt;=
.&lt;year/month/day&gt;<br>
&gt;&gt; &gt; and<br>
&gt;&gt; &gt; then a list of columns with timestamps within this range and<=
br>
&gt;&gt; &gt; RandomPartitioner or using OPP but preprocess part of the key=
 with MD5,<br>
&gt;&gt; &gt; i.e.<br>
&gt;&gt; &gt; the key is MD5(&lt;id&gt;.&lt;year/month/day&gt;) + &quot;hou=
r of the day&quot; . Just the<br>
&gt;&gt; &gt; problem<br>
&gt;&gt; &gt; is how to deal with large number of columns being inserted in=
 a<br>
&gt;&gt; &gt; particular<br>
&gt;&gt; &gt; row.<br>
&gt;&gt; &gt; Thank you very much!<br>
&gt;&gt; &gt; Best regards, Daniel.<br>
&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div></div>

--001636c92874b0246004854dc439--