Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id CA662DB8C for ; Tue, 25 Sep 2012 06:37:24 +0000 (UTC) Received: (qmail 17663 invoked by uid 500); 25 Sep 2012 06:37:22 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 17463 invoked by uid 500); 25 Sep 2012 06:37:19 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 17430 invoked by uid 99); 25 Sep 2012 06:37:18 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2012 06:37:18 +0000 X-ASF-Spam-Status: No, hits=3.2 required=5.0 tests=FREEMAIL_REPLY,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of roshni_rajagopal@hotmail.com designates 65.55.34.141 as permitted sender) Received: from [65.55.34.141] (HELO col0-omc3-s3.col0.hotmail.com) (65.55.34.141) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2012 06:37:10 +0000 Received: from COL121-W13 ([65.55.34.135]) by col0-omc3-s3.col0.hotmail.com with Microsoft SMTPSVC(6.0.3790.4675); Mon, 24 Sep 2012 23:36:48 -0700 Message-ID: Content-Type: multipart/alternative; boundary="_c99d5a69-8070-4fe7-98d5-e88a30afd58e_" X-Originating-IP: [216.207.42.15] From: Roshni Rajagopal To: Subject: RE: Cassandra Counters Date: Tue, 25 Sep 2012 12:06:48 +0530 Importance: Normal In-Reply-To: References: ,,,, MIME-Version: 1.0 X-OriginalArrivalTime: 25 Sep 2012 06:36:48.0743 (UTC) FILETIME=[23CE5F70:01CD9AE8] --_c99d5a69-8070-4fe7-98d5-e88a30afd58e_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Thanks for the reply and sorry for being bull - headed. Once you're past the stage where you've decided its distributed=2C and NoS= QL and cassandra out of all the NoSQL options=2CNow to count something=2C y= ou can do it in different ways in cassandra. In all the ways you want to us= e cassandra's best features of availability=2C tunable consistency =2C part= ition tolerance etc. Given this=2C what are the performance tradeoffs of using counters vs a sta= ndard column family for counting. Because as I see if the counter number in= a counter column family becomes wrong=2C it will not be 'eventually consis= tent' - you will need intervention to correct it. So the key aspect is how = much faster would be a counter column family=2C and at what numbers do we s= tart seing a difference. Date: Tue=2C 25 Sep 2012 07:57:08 +0200 Subject: Re: Cassandra Counters From: oleksandr.petrov@gmail.com To: user@cassandra.apache.org Maybe I'm missing the point=2C but counting in a standard column family wou= ld be a little overkill.=20 I assume that "distributed counting" here was more of a map/reduce approach= =2C where Hadoop (+ Cascading=2C Pig=2C Hive=2C Cascalog) would help you a = lot. We're doing some more complex counting (e.q. based on sets of rules) l= ike that. Of course=2C that would perform _way_ slower than counting before= hand. On the other side=2C you will always have a consistent result for a c= onsistent dataset. On the other hand=2C if you use things like AMQP or Storm (sorry to put up = my sentence together like that=2C as tools are mostly either orthogonal or = complementary=2C but I hope you get my point)=2C you could build a topology= that makes fault-tolerant writes independently of your original write. Of = course=2C it would still have a consistency tradeoff=2C mostly because of r= ace conditions and different network latencies etc. =20 So I would say that building a data model in a distributed system often dep= ends more on your problem than on the common patterns=2C because everything= has a tradeoff.=20 Want to have an immediate result? Modify your counter while writing the row= . Can sacrifice speed=2C but have more counting opportunities? Go with offlin= e distributed counting.Want to have kind of both=2C dispatch a message and = react upon it=2C having the processing logic and writes decoupled from main= application=2C allowing you to care less about speed. However=2C I may have missed the point somewhere (early morning=2C you know= )=2C so I may be wrong in any given statement.Cheers On Tue=2C Sep 25=2C 2012 at 6:53 AM=2C Roshni Rajagopal wrote: Thanks Milind=2C Has anyone implemented counting in a standard col family in cassandra=2C wh= en you can have increments and decrements to the count. Any comparisons in = performance to using counter column families?=20 Regards=2CRoshni Date: Mon=2C 24 Sep 2012 11:02:51 -0700 Subject: RE: Cassandra Counters From: milindparikh@gmail.com To: user@cassandra.apache.org IMO You would use Cassandra Counters (or other variation of distributed countin= g) in case of having determined that a centralized version of counting is n= ot going to work. You'd determine the non_feasibility of centralized counting by figuring the= speed at which you need to sustain writes and reads and reconcile that wit= h your hard disk seek times (essentially). Once you have "proved" that you can't do centralized counting=2C the second= layer of arsenal comes into play=3B which is distributed counting. In distributed counting =2C the CAP theorem comes into life. & in Cassandra= =2C Availability and Network Partitioning trumps over Consistency.=20 =20 So yes=2C you sacrifice strong consistency for availability and partion tol= erance=3B for eventual consistency. On Sep 24=2C 2012 10:28 AM=2C "Roshni Rajagopal" wrote: Hi folks=2C I looked at my mail below=2C and Im rambling a bit=2C so Ill try to re-s= tate my queries pointwise.=20 a) what are the performance tradeoffs on reads & writes between creating a = standard column family and manually doing the counts by a lookup on a key= =2C versus using counters.=20 b) whats the current state of counters limitations in the latest version of= apache cassandra? c) with there being a possibilty of counter values getting out of sync=2C w= ould counters not be recommended where strong consistency is desired. The n= ormal benefits of cassandra's tunable consistency would not be applicable= =2C as re-tries may cause overstating. So the normal use case is high perfo= rmance=2C and where consistency is not paramount. Regards=2Croshni From: roshni_rajagopal@hotmail.com To: user@cassandra.apache.org Subject: Cassandra Counters Date: Mon=2C 24 Sep 2012 16:21:55 +0530 Hi =2C I'm trying to understand if counters are a good fit for my use case.Ive wat= ched http://blip.tv/datastax/counters-in-cassandra-5497678 many times over = now... and still need help! Suppose I have a list of items- to which I can add or delete a set of items= at a time=2C and I want a count of the items=2C without considering chang= ing the database or additional components like zookeeper=2C I have 2 options_ the first is a counter col family=2C and the second is a = standard one =20 =20 1. List_Counter_CF =20 =20 =20 =20 =20 =20 TotalItems =20 =20 =20 =20 =20 =20 ListId 50 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 =20 2.List_Std_CF =20 =20 =20 =20 =20 =20 =20 =20 TimeUUID1 TimeUUID2 TimeUUID3 TimeUUID4 TimeUUID5 =20 =20 ListId 3 70 -20 3 -6 =20 And in the second I can add a new col with every set of items added or dele= ted. Over time this row may grow wide.To display the final count=2C Id need= to read the row=2C slice through all columns and add them. In both cases the writes should be fast=2C in fact standard col family shou= ld be faster as there's no read=2C before write. And for CL ONE write the l= atency should be same. For reads=2C the first option is very good=2C just r= ead one column for a key For the second=2C the read involves reading the row=2C and adding each colu= mn value via application code. I dont think there's a way to do math via CQ= L yet.There should be not hot spotting=2C if the key is sharded well. I cou= ld even maintain the count derived from the List_Std_CF in a separate colum= n family which is a standard col family with the final number=2C but I coul= d do that as a separate process immediately after the write to List_Std_CF= completes=2C so that its not blocking. I understand cassandra is faster f= or writes than reads=2C but how slow would Reading by row key be...? Is the= re any number around after how many columns the performance starts deterior= ating=2C or how much worse in performance it would be?=20 The advantage I see is that I can use the same consistency rules as for the= rest of column families. If quorum for reads & writes=2C then you get stro= ngly consistent values. In case of counters I see that in case of timeout e= xceptions because the first replica is down or not responding=2C there's a = chance of the values getting messed up=2C and re-trying can mess it up furt= her. Its not idempotent like a standard col family design can be. If it gets messed up=2C it would need administrator's help (is there a a do= cument on how we could resolve counter values going wrong?) I believe the rest of the limitations still hold good- has anything changed= in recent versions? In my opinion=2C they are not as major as the consiste= ncy question. -removing a counter & then modifying value - behaviour is undetermined-spec= ial process for counter col family sstable loss( need to remove all files)-= no TTL support-no secondary indexes In short=2C I can recommend counters can be used for analytics or while dea= ling with data where the exact numbers are not important=2C orwhen its ok t= o take some time to fix the mismatch=2C and the performance requirements ar= e most important. However where the numbers should match =2C its better to use a std column f= amily and a manual implementation. Please share your thoughts on this. Regards=2Croshni =20 =20 --=20 alex p = --_c99d5a69-8070-4fe7-98d5-e88a30afd58e_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Thanks for the reply and sorry for being bull - headed.

= Once  =3Byou're past the stage where you've decided its distributed=2C = and NoSQL and cassandra out of all the NoSQL options=2C
Now = =3Bto count something=2C you can do it = in different ways in cassandra. =3B
In all the ways you want to use cassandra's best features= of availability=2C tunable consistency =2C partition tolerance etc.=

Given this=2C what are the performance tradeo= ffs of using counters vs a standard column family for counting. Because as = I see if the counter number in a counter column family becomes wrong=2C it = will not be 'eventually consistent' - you will need intervention to correct= it. So the key aspect is how much faster would be a counter column family= =2C and at what numbers do we start seing a difference.






Date: Tue=2C 25 Sep = 2012 07:57:08 +0200
Subject: Re: Cassandra Counters
From: oleksandr.p= etrov@gmail.com
To: user@cassandra.apache.org

Maybe I'm miss= ing the point=2C but counting in a standard column family would be a little= overkill. =3B

I assume that "distributed coun= ting" here was more of a map/reduce approach=2C where Hadoop (+ Cascading= =2C Pig=2C Hive=2C Cascalog) would help you a lot. We're doing some more co= mplex counting (e.q. based on sets of rules) like that. Of course=2C that w= ould perform _way_ slower than counting beforehand. On the other side=2C yo= u will always have a consistent result for a consistent dataset.

On the other hand=2C if you use things like AMQP or Sto= rm (sorry to put up my sentence together like that=2C as tools are mostly e= ither orthogonal or complementary=2C but I hope you get my point)=2C you co= uld build a topology that makes fault-tolerant writes independently of your= original write. Of course=2C it would still have a consistency tradeoff=2C= mostly because of race conditions and different network latencies etc. &nb= sp=3B

So I would say that building a data model in a distribu= ted system often depends more on your problem than on the common patterns= =2C because everything has a tradeoff. =3B

Wan= t to have an immediate result? Modify your counter while writing the row.
Can sacrifice speed=2C but have more counting opportunities? Go with o= ffline distributed counting.
Want to have kind of both=2C dispatc= h a message and react upon it=2C having the processing logic and writes dec= oupled from main application=2C allowing you to care less about speed.

However=2C I may have missed the point somewhere (early= morning=2C you know)=2C so I may be wrong in any given statement.
Cheers


O= n Tue=2C Sep 25=2C 2012 at 6:53 AM=2C Roshni Rajagopal &l= t=3Broshni_rajagopal@hotmai= l.com>=3B wrote:
Thanks Milind=2C

Has anyone implemented counting in a st= andard col family in cassandra=2C when you can have increments and decremen= ts to the count. =3B
Any comparisons in performance to using = counter column families? =3B

Regards=2C
Roshni



Date: Mon=2C 24 Sep 2012 11:02:51 -0700
Subject: RE: Cassandr= a Counters
From: milindparikh@= gmail.com
To: user@cassandra.apache.org<= /a>


IMO
You would use Cassandra Counters (or other variation of distributed countin= g) in case of having determined that a centralized version of counting is n= ot going to work.
You'd determine the non_feasibility of centralized counting by figuring the= speed at which you need to sustain writes and reads and reconcile that wit= h your hard disk seek times (essentially).
Once you have "proved" that you can't do centralized counting=2C the second= layer of arsenal comes into play=3B which is distributed counting.
In distributed counting =2C the CAP theorem comes into life. &=3B in Cas= sandra=2C Availability and Network Partitioning trumps over Consistency.
So yes=2C you sacrifice strong consistency for availability and partion tol= erance=3B for eventual consistency.
On Sep 24=2C 2012 10:28 AM=2C "Roshni Rajagopal" <=3Broshni_rajagopal@hotmail.com>=3B wro= te:
Hi folks=2C

 =3B  =3BI looked at my mail below= =2C and Im rambling a bit=2C so Ill try to re-state my queries pointwise.&n= bsp=3B

a) what are the performance tradeoffs on re= ads &=3B writes between creating a standard column family and manually d= oing the counts by a lookup on a key=2C versus using counters. =3B

b) whats the current state of counters limitations in t= he latest version of apache cassandra?

c) with the= re being a possibilty of counter values getting out of sync=2C would counte= rs not be recommended where strong consistency is desired. The normal benef= its of cassandra's tunable consistency would not be applicable=2C as re-tri= es may cause overstating. So the normal use case is high performance=2C and= where consistency is not paramount.

Regards=2C
roshni




From: roshni_rajagopal@hotmail.com
To: user@cassandra.apache.org
Subject: Cassandra Counters
Date: Mon=2C 24 Sep 2012 16:21:55 +0530
<= br>
Hi =2C

I'm trying to understand if counters are a good f= it for my use case.
and still need help!

Suppose I have a list of= items- to which I can add or delete a set of items at a time=2C  =3Ban= d I want a count of the items=2C without considering changing the database =  =3Bor additional components like zookeeper=2C
I have 2 options_ the first is a counter col family=2C and the second = is a standard one
1. List_Counter_CF
TotalItems
ListId 50
2.List_Std_CF

TimeUUID1 TimeUUID2 TimeUUID3 TimeUUID4 TimeUUID5
ListId 3 70 -20 3 -6

An= d in the second I can add a new col with every set of items added or delete= d. Over time this row may grow wide.
To display the final = count=2C Id need to read the row=2C slice through all columns and add them.=

In both cases the writes should be fast=2C in fact stan= dard col family should be faster as there's no read=2C before write. And fo= r CL ONE write the latency should be same. =3B
For reads=2C t= he first option is very good=2C just read one column for a key

For the second=2C the read involves reading the row=2C = and adding each column value via application code. I dont think there's a w= ay to do math via CQL yet.
There should be not hot spotting=2C if= the key is sharded well. I could even maintain the count derived from the = List_Std_CF in a separate column family which is a standard col family with= the final number=2C but I could do that as a separate process  =3Bimme= diately after the write to List_Std_CF completes=2C so that its not blockin= g.  =3BI understand cassandra is faster for writes than reads=2C but ho= w slow would Reading by row key be...? Is there any number around after how= many columns the performance starts deteriorating=2C or how much worse in = performance it would be? =3B

The advantage I see is that I can use the same consiste= ncy rules as for the rest of column families. If quorum for reads &=3B w= rites=2C then you get strongly consistent values. =3B
In case= of counters I see that in case of timeout =3Bexceptions =3Bbecause= the first replica is down or not responding=2C there's a chance of the val= ues getting messed up=2C and re-trying can mess it up further. Its not idem= potent like a standard col family design can be.

If it gets messed up=2C it would need administrator's h= elp (is there a a document on how we could resolve counter values going wro= ng?)

I believe the rest of the limitations still h= old good- has anything changed in recent versions? In my opinion=2C they ar= e not as major as the consistency question.
-removing a counter &=3B then modifying value - behaviour is undete= rmined
-special process for counter col family sstable loss( need= to remove all files)
-no TTL support
-no secondary ind= exes


In short=2C I can recommend counters can= be used for analytics or while dealing with data where the exact numbers a= re not important=2C or
when its ok to take some time to fix the m= ismatch=2C and the performance requirements are most important.
However where the numbers should match = =2C its better to use a std column family and a manual implementation.

Please share your thoughts on this.

Regards=2C
roshni
 =3B
=



--
alex p
= --_c99d5a69-8070-4fe7-98d5-e88a30afd58e_--