Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 43096 invoked from network); 28 Apr 2010 13:11:50 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 Apr 2010 13:11:50 -0000 Received: (qmail 49332 invoked by uid 500); 28 Apr 2010 13:11:49 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 49316 invoked by uid 500); 28 Apr 2010 13:11:49 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 49308 invoked by uid 99); 28 Apr 2010 13:11:49 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Apr 2010 13:11:49 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dsimeonov@gmail.com designates 74.125.83.44 as permitted sender) Received: from [74.125.83.44] (HELO mail-gw0-f44.google.com) (74.125.83.44) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Apr 2010 13:11:42 +0000 Received: by gwaa12 with SMTP id a12so3806817gwa.31 for ; Wed, 28 Apr 2010 06:11:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=0vWDBbY5t0RzG43NnXyJ152KuSYNoAqG4dZTB6Tw2fk=; b=jmbo5hrn5EJw5/WgND2C0OhS4h/4vrd454bLm4DYObjvQV8fVmUy9O/W7zQl4HyoYQ C8XSGfCpHBHuIaQXlb34KyvZJfZwv6gUOjRkyu3pbY50zOhmXctyel4gMuNIsPKn3SpO A0f2xK/V4lS62mGz/dsAU1IwRaCbGBvSrq0y4= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=sKB3IyDL2qxHuikqFLMcWfyYjTRPMM8fEBex3r4qc8ycgwRctJ/NJptnGHv3vuoIEL x89OFQIg+s3aNf+9OJUfQaPgHATe/rUrPVzETQP/eKVjpraZO4tJLHMI4m9EYAsWLcjC KpcufMXAP7+xn7dBB0iV598y7et6v3EFZSOBg= MIME-Version: 1.0 Received: by 10.100.50.7 with SMTP id x7mr2796272anx.191.1272460281662; Wed, 28 Apr 2010 06:11:21 -0700 (PDT) Received: by 10.100.152.11 with HTTP; Wed, 28 Apr 2010 06:11:21 -0700 (PDT) In-Reply-To: References: Date: Wed, 28 Apr 2010 16:11:21 +0300 Message-ID: Subject: Re: question about how columns are deserialized in memory From: =?UTF-8?B?0JTQsNC90LjQtdC7INCh0LjQvNC10L7QvdC+0LI=?= To: user@cassandra.apache.org, sylvain@yakaz.com Content-Type: multipart/alternative; boundary=0016e64718c2144eea04854bbf1f X-Virus-Checked: Checked by ClamAV on apache.org --0016e64718c2144eea04854bbf1f Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Sylvain, Thank you very much! I still have some further questions, I didn't find how row cache is being configured? Regarding the splitting of rows, I understand that it is not so necessary, still I am curious whether it is implementable by the client code. Best regards, Daniel. 2010/4/28 Sylvain Lebresne > 2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5= =D0=BE=D0=BD=D0=BE=D0=B2 : > > Hi, > > I have a question about if a row in a Column Family has only columns > > whether all of the columns are deserialized in memory if you need any o= f > > them? As I understood it is the case, > > No it's not. Only the columns you request are deserialized in memory. The > only > thing is that, as of now, during compaction the entire row will be > deserialize at > once. So it just have to still fit in memory. But depending of the > typical size of > your column, you can easily millions of columns in a row without it > being a problem > at all. > > > and if the Column Family is super > > Column Family, then only the Super Column (entire) is brought up in > memory? > > Yes, that part is true. That is the problem with the current > implementation of super > columns. While you can have lots of column in one row, you probably > don't want to > have lots of columns in one super column (but it's no problem to have > lots of super > column in one row). > > > What about row cache, is it different than memtable? > > Be careful with row cache. If row cache is enable, then yes, any read > in a row will read > the entire row. So you typically don't want to use row cache in column > family where rows > have lots of columns (unless you always read all the columns in the > row each time of > course). > > > I have another one question, let's say there is only data to be inserte= d > and > > a solution to it is to have columns to be added to rows in Column Famil= y, > is > > it possible in Cassandra to split the row if certain threshold is > reached, > > say 100 columns per row, what if there are concurrent inserts? > > No, cassandra can't do that for you. But you should be okay with what > you describe > below. That is, if a given row corresponds to an hour of data, it will > limit it's size. > And again, the number of column in a row is not really limited as long as > the > overall size of the row fits easily in memory. > > > The original data model and use case is to insert timestamped data and = to > > make range queries. The original keys of CF rows were in the form of > > . and then a single column with data, OPP was used. This > is > > not an optimal solution, since nodes are hotter than others, I am > thinking > > of changing the model in the way to have keys like . > and > > then a list of columns with timestamps within this range and > > RandomPartitioner or using OPP but preprocess part of the key with MD5, > i.e. > > the key is MD5(.) + "hour of the day" . Just the > problem > > is how to deal with large number of columns being inserted in a > particular > > row. > > Thank you very much! > > Best regards, Daniel. > --0016e64718c2144eea04854bbf1f Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi Sylvain,
=C2=A0=C2=A0Thank you very much! I still have some further = questions, I didn't find how row cache is being configured? Regarding t= he splitting of rows, I understand that it is not so necessary, still I am = curious whether it is implementable by the client code.=C2=A0
Best regards, Daniel.

2010/4/28 Sylv= ain Lebresne <syl= vain@yakaz.com>
2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5=D0= =BE=D0=BD=D0=BE=D0=B2 <dsimeonov@= gmail.com>:
> Hi,
> =C2=A0=C2=A0 I have a question about if a row in a Column Family has o= nly columns
> whether all of the columns are deserialized in memory if you need any = of
> them? As I understood it is the case,

No it's not. Only the columns you request are deserialized in mem= ory. The only
thing is that, as of now, during compaction the entire row will be
deserialize at
once. So it just have to still fit in memory. But depending of the
typical size of
your column, you can easily millions of columns in a row without it
being a problem
at all.

> =C2=A0and if the Column Family is super
> Column Family, then only the Super Column (entire) is brought up in me= mory?

Yes, that part is true. That is the problem with the current
implementation of super
columns. While you can have lots of column in one row, you probably
don't want to
have lots of columns in one super column (but it's no problem to have lots of super
column in one row).

> What about row cache, is it different than memtable?

Be careful with row cache. If row cache is enable, then yes, any read=
in a row will read
the entire row. So you typically don't want to use row cache in column<= br> family where rows
have lots of columns (unless you always read all the columns in the
row each time of
course).

> I have another one question, let's say there is only data to be in= serted and
> a solution to it is to have columns to be added to rows in Column Fami= ly, is
> it possible in=C2=A0Cassandra=C2=A0to split the row if certain=C2=A0th= reshold=C2=A0is reached,
> say 100 columns per row, what if there are=C2=A0concurrent=C2=A0insert= s?

No, cassandra can't do that for you. But you should be okay with = what
you describe
below. That is, if a given row corresponds to an hour of data, it will
limit it's size.
And again, the number of column in a row is not really limited as long as t= he
overall size of the row fits easily in memory.

> The original data model and use case is to insert timestamped data and= to
> make range queries. The original keys of CF rows were in the form of > <id>.<timestamp> and then a single column with data, OPP w= as used. This is
> not an optimal solution, since nodes are hotter than others, I am thin= king
> of changing the model in the way to have keys like <id>.<year= /month/day> and
> then a list of columns with timestamps within this range and
> RandomPartitioner or using OPP but preprocess part of the key with MD5= , i.e.
> the key is MD5(<id>.<year/month/day>) + "hour of the = day" . Just the problem
> is how to deal with large number of columns being inserted in a partic= ular
> row.
> Thank you very much!
> Best regards, Daniel.

--0016e64718c2144eea04854bbf1f--