Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 97387 invoked from network); 9 Jan 2011 18:57:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Jan 2011 18:57:54 -0000 Received: (qmail 83468 invoked by uid 500); 9 Jan 2011 18:57:52 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 83439 invoked by uid 500); 9 Jan 2011 18:57:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 83431 invoked by uid 99); 9 Jan 2011 18:57:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Jan 2011 18:57:52 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [209.85.214.44] (HELO mail-bw0-f44.google.com) (209.85.214.44) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 09 Jan 2011 18:57:45 +0000 Received: by bwz12 with SMTP id 12so18522264bwz.31 for ; Sun, 09 Jan 2011 10:54:41 -0800 (PST) MIME-Version: 1.0 Received: by 10.204.72.71 with SMTP id l7mr3159470bkj.55.1294599280819; Sun, 09 Jan 2011 10:54:40 -0800 (PST) Received: by 10.204.176.12 with HTTP; Sun, 9 Jan 2011 10:54:40 -0800 (PST) X-Originating-IP: [70.124.90.200] In-Reply-To: References: Date: Sun, 9 Jan 2011 12:54:40 -0600 Message-ID: Subject: Re: A few quick questions to help me design a better schema.. From: Tyler Hobbs To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016e6dd97b1427f4304996e62c9 --0016e6dd97b1427f4304996e62c9 Content-Type: text/plain; charset=ISO-8859-1 > > 1. ) If certain columns in a row get mutated too frequently or if new > columns are added to the row frequently then does the reads of old columns > that rarely get changed is also affected ? In other words, is the > performance of reads of almost infrequently changing columns in a row where > some columns are frequently updated/inserted, affected in any manner ? > Yes, the performance of reading columns that you haven't changed will still be affected by changing other columns in the row. Constantly updating a row causes it to be split across multiple SSTables. If you are asking for the columns by name, you may not need to actually read any extra data from most of the SSTables, but you will need to at least read the per-row Bloom Filter on each (or read the index and scan a portion of the row for slices); this costs one seek for each SSTable. > 2. ) Are all columns inside a super column family, supercolumns or can they > may be simple columns+supercolumns as well ? > They are all super columns. There is no mixing of column types. > 3. ) When row cache is enabled and certain columns of a row are read then > will the entire row be put into the cache or just those read columns are put > into cache? > The entire row will be put into the cache. This is good motivation for splitting timelines into multiple rows by a relatively low timespan if you mainly read the very end of the timeline. Note that there has been discussion somewhere of allowing you to only cache the last N columns of a row in the row cache. > 4. ) Does the larger no of column families has any impact on the > performance(I read about it somewhere)? Should information for a particular > row key be split in multiple column families according to the specific query > demands or should all data related to a particular row key be kept together > in a single column family ? > A higher number of column families requires more memory to be used and more compactions to occur. I can't answer the rest of the question accurately without more detail on the particular use case. > 5. ) Are there any limitation of valueless column to consider. I read in a > ppt "Only works with <= 2B columns in 0.7 valueless colum". I could > understand the meaning of this statement. > I believe this is referring to the 2 billion column limit per row. In the real world, you generally don't want to get anywhere near that many columns in a single row. - Tyler --0016e6dd97b1427f4304996e62c9 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
1. ) If certain columns in a row get mutated too frequently or if= new columns are added to the row frequently then does the reads of old col= umns that rarely get changed is also affected ? In other words, is the perf= ormance of reads of almost infrequently changing columns in a row where som= e columns are frequently updated/inserted, affected in any manner ?

Yes, the performance of reading columns that you have= n't changed will still be affected by changing other columns in the row= .=A0 Constantly updating a row causes it to be split across multiple SSTabl= es.=A0 If you are asking for the columns by name, you may not need to actua= lly read any extra data from most of the SSTables, but you will need to at = least read the per-row Bloom Filter on each (or read the index and scan a p= ortion of the row for slices); this costs one seek for each SSTable.
=A0
2. ) Are= all columns inside a super column family, supercolumns or can they may be = simple columns+supercolumns=A0 as well ?

They are all super columns.=A0 There is no mixing of = column types.
=A0
3. ) When row cache is enabled and certain=A0 columns of a row are read the= n will the entire row be put into the cache or just those read columns are = put into cache?

The entire row will be put into th= e cache.=A0 This is good motivation for splitting timelines into multiple r= ows by a relatively low timespan if you mainly read the very end of the tim= eline.=A0 Note that there has been discussion somewhere of allowing you to = only cache the last N columns of a row in the row cache.
=A0
4. ) Doe= s the larger no of column families has any impact on the performance(I read= about it somewhere)? Should information for a particular row key be split = in multiple column families according to the specific query demands or shou= ld all data related to a particular row key be kept together in a single co= lumn family ?

A higher number of column families requires more memo= ry to be used and more compactions to occur.=A0 I can't answer the rest= of the question accurately without more detail on the particular use case.=
=A0
5. ) Are= there any limitation of valueless column to consider. I read in a ppt=A0= =A0 "Only works with <=3D 2B columns in 0.7 valueless colum". = I could understand the meaning of this statement.

I believe this is referring to the 2 billion column l= imit per row.=A0 In the real world, you generally don't want to get any= where near that many columns in a single row.
=A0
- Tyler
--0016e6dd97b1427f4304996e62c9--