Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 61209 invoked from network); 28 Apr 2010 15:36:33 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 28 Apr 2010 15:36:33 -0000 Received: (qmail 22324 invoked by uid 500); 28 Apr 2010 15:36:32 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 22304 invoked by uid 500); 28 Apr 2010 15:36:32 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 22296 invoked by uid 99); 28 Apr 2010 15:36:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Apr 2010 15:36:32 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=AWL,FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dsimeonov@gmail.com designates 209.85.211.195 as permitted sender) Received: from [209.85.211.195] (HELO mail-yw0-f195.google.com) (209.85.211.195) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Apr 2010 15:36:26 +0000 Received: by ywh33 with SMTP id 33so9107924ywh.11 for ; Wed, 28 Apr 2010 08:36:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=HYzwqkD0Wxjm+XA10boYSST1oDgMgv3LHxJuNzi1KrQ=; b=ZqlpobsJdNcBqvpZZBq0bwhcnGVHf/sYZ1LGSvsg63SAbgBi6lryUBxZd3GIXU/f3P y13I797jPwPj5i67phme9VShT9VEEuI3siU9xNacSSUqblOhD2BtqOehwgcSwRTAnqYu yYmTbJtTCsOqydyRX7cM32PMRK0ZHPfqTuNeU= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=chYsHKSOV0csWnj6V3C6c2dYFZ6NI2yjSUVmrOgrRarLgvZHQzarBInTogGV/BqIlX mnk61A+3NlHS6/ba10GIqHLfj3M223h8A3ykGce800pRUNS0/FK4P8WNDkhw9zMWfoCl 2s4ZDNPa0go4tgTHxYugI/EQwdCcD61ZjGPVc= MIME-Version: 1.0 Received: by 10.101.180.4 with SMTP id h4mr3358973anp.259.1272468965696; Wed, 28 Apr 2010 08:36:05 -0700 (PDT) Received: by 10.100.152.11 with HTTP; Wed, 28 Apr 2010 08:36:05 -0700 (PDT) In-Reply-To: References: Date: Wed, 28 Apr 2010 18:36:05 +0300 Message-ID: Subject: Re: question about how columns are deserialized in memory From: =?UTF-8?B?0JTQsNC90LjQtdC7INCh0LjQvNC10L7QvdC+0LI=?= To: Sylvain Lebresne , user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001636c92874b0246004854dc439 --001636c92874b0246004854dc439 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi, What about if the upper bound of columns in a row is loosely defined, i.e= . it is ok that we have maximum of around 100 for example, but not exactly (maybe 105, 110)? What if I make a slice query to return say 1/5th of the columns in a row, I believe that such query again will not deserialize all columns in memory? Best regards, Daniel. 2010/4/28 Sylvain Lebresne > 2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5= =D0=BE=D0=BD=D0=BE=D0=B2 : > > Hi Sylvain, > > Thank you very much! I still have some further questions, I didn't fi= nd > > how row cache is being configured? > > Provided you don't use trunk but something stable like 0.6.1 (which > you should), > it is in storage-conf.xml. It's one option of the definition of the > column families (it > is documented in the file). > > > Regarding the splitting of rows, I > > understand that it is not so necessary, still I am curious whether it i= s > > implementable by the client code. > > Well, I'm not sure there is any simple way to do it (at least not > efficiently). Counting > the number of columns in a row is expensive plus there is no easy way > to implement > counter in cassandra (even though > https://issues.apache.org/jira/browse/CASSANDRA-580 > will make that better someday). > > > Best regards, Daniel. > > > > 2010/4/28 Sylvain Lebresne > >> > >> 2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0= =B5=D0=BE=D0=BD=D0=BE=D0=B2 : > >> > Hi, > >> > I have a question about if a row in a Column Family has only > columns > >> > whether all of the columns are deserialized in memory if you need an= y > of > >> > them? As I understood it is the case, > >> > >> No it's not. Only the columns you request are deserialized in memory. > The > >> only > >> thing is that, as of now, during compaction the entire row will be > >> deserialize at > >> once. So it just have to still fit in memory. But depending of the > >> typical size of > >> your column, you can easily millions of columns in a row without it > >> being a problem > >> at all. > >> > >> > and if the Column Family is super > >> > Column Family, then only the Super Column (entire) is brought up in > >> > memory? > >> > >> Yes, that part is true. That is the problem with the current > >> implementation of super > >> columns. While you can have lots of column in one row, you probably > >> don't want to > >> have lots of columns in one super column (but it's no problem to have > >> lots of super > >> column in one row). > >> > >> > What about row cache, is it different than memtable? > >> > >> Be careful with row cache. If row cache is enable, then yes, any read > >> in a row will read > >> the entire row. So you typically don't want to use row cache in column > >> family where rows > >> have lots of columns (unless you always read all the columns in the > >> row each time of > >> course). > >> > >> > I have another one question, let's say there is only data to be > inserted > >> > and > >> > a solution to it is to have columns to be added to rows in Column > >> > Family, is > >> > it possible in Cassandra to split the row if certain threshold is > >> > reached, > >> > say 100 columns per row, what if there are concurrent inserts? > >> > >> No, cassandra can't do that for you. But you should be okay with what > >> you describe > >> below. That is, if a given row corresponds to an hour of data, it will > >> limit it's size. > >> And again, the number of column in a row is not really limited as long > as > >> the > >> overall size of the row fits easily in memory. > >> > >> > The original data model and use case is to insert timestamped data a= nd > >> > to > >> > make range queries. The original keys of CF rows were in the form of > >> > . and then a single column with data, OPP was used. > This > >> > is > >> > not an optimal solution, since nodes are hotter than others, I am > >> > thinking > >> > of changing the model in the way to have keys like > . > >> > and > >> > then a list of columns with timestamps within this range and > >> > RandomPartitioner or using OPP but preprocess part of the key with > MD5, > >> > i.e. > >> > the key is MD5(.) + "hour of the day" . Just the > >> > problem > >> > is how to deal with large number of columns being inserted in a > >> > particular > >> > row. > >> > Thank you very much! > >> > Best regards, Daniel. > > > > > --001636c92874b0246004854dc439 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Hi,
=C2=A0=C2=A0What about if the upper bound of columns in a row is lo= osely defined, i.e. it is ok that we have maximum of around 100 for example= , but not exactly (maybe 105, 110)?=C2=A0
What if I make a slice = query to return say 1/5th of the columns in a row, I believe that such quer= y again will not deserialize all columns in memory?
Best regards, Daniel.

2010/4/28 Sylv= ain Lebresne <syl= vain@yakaz.com>
2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0= =B8=D0=BC=D0=B5=D0=BE=D0=BD=D0=BE=D0=B2 <dsimeonov@gmail.com>:
> Hi Sylvain,
> =C2=A0=C2=A0Thank you very much! I still have some further questions, = I didn't find
> how row cache is being configured?

Provided you don't use trunk but something stable like 0.6.1 (whi= ch
you should),
it is in storage-conf.xml. It's one option of the definition of the
column families (it
is documented in the file).

> Regarding the splitting of rows, I
> understand that it is not so necessary, still I am curious whether it = is
> implementable by the client code.

Well, I'm not sure there is any simple way to do it (at least not=
efficiently). Counting
the number of columns in a row is expensive plus there is no easy way
to implement
counter in cassandra (even though
https://issues.apache.org/jira/browse/CASSANDRA-580
will make that better someday).

> Best regards, Daniel.
>
> 2010/4/28 Sylvain Lebresne <sy= lvain@yakaz.com>
>>
>> 2010/4/28 =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC= =D0=B5=D0=BE=D0=BD=D0=BE=D0=B2 <d= simeonov@gmail.com>:
>> > Hi,
>> > =C2=A0=C2=A0 I have a question about if a row in a Column Fam= ily has only columns
>> > whether all of the columns are deserialized in memory if you = need any of
>> > them? As I understood it is the case,
>>
>> No it's not. Only the columns you request are deserialized in = memory. The
>> only
>> thing is that, as of now, during compaction the entire row will be=
>> deserialize at
>> once. So it just have to still fit in memory. But depending of the=
>> typical size of
>> your column, you can easily millions of columns in a row without i= t
>> being a problem
>> at all.
>>
>> > =C2=A0and if the Column Family is super
>> > Column Family, then only the Super Column (entire) is brought= up in
>> > memory?
>>
>> Yes, that part is true. That is the problem with the current
>> implementation of super
>> columns. While you can have lots of column in one row, you probabl= y
>> don't want to
>> have lots of columns in one super column (but it's no problem = to have
>> lots of super
>> column in one row).
>>
>> > What about row cache, is it different than memtable?
>>
>> Be careful with row cache. If row cache is enable, then yes, any r= ead
>> in a row will read
>> the entire row. So you typically don't want to use row cache i= n column
>> family where rows
>> have lots of columns (unless you always read all the columns in th= e
>> row each time of
>> course).
>>
>> > I have another one question, let's say there is only data= to be inserted
>> > and
>> > a solution to it is to have columns to be added to rows in Co= lumn
>> > Family, is
>> > it possible in=C2=A0Cassandra=C2=A0to split the row if certai= n=C2=A0threshold=C2=A0is
>> > reached,
>> > say 100 columns per row, what if there are=C2=A0concurrent=C2= =A0inserts?
>>
>> No, cassandra can't do that for you. But you should be okay wi= th what
>> you describe
>> below. That is, if a given row corresponds to an hour of data, it = will
>> limit it's size.
>> And again, the number of column in a row is not really limited as = long as
>> the
>> overall size of the row fits easily in memory.
>>
>> > The original data model and use case is to insert timestamped= data and
>> > to
>> > make range queries. The original keys of CF rows were in the = form of
>> > <id>.<timestamp> and then a single column with da= ta, OPP was used. This
>> > is
>> > not an optimal solution, since nodes are hotter than others, = I am
>> > thinking
>> > of changing the model in the way to have keys like <id>= .<year/month/day>
>> > and
>> > then a list of columns with timestamps within this range and<= br> >> > RandomPartitioner or using OPP but preprocess part of the key= with MD5,
>> > i.e.
>> > the key is MD5(<id>.<year/month/day>) + "hou= r of the day" . Just the
>> > problem
>> > is how to deal with large number of columns being inserted in= a
>> > particular
>> > row.
>> > Thank you very much!
>> > Best regards, Daniel.
>
>

--001636c92874b0246004854dc439--