Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6ACACD172 for ; Tue, 25 Sep 2012 11:23:18 +0000 (UTC) Received: (qmail 52249 invoked by uid 500); 25 Sep 2012 11:23:15 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 52087 invoked by uid 500); 25 Sep 2012 11:23:13 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 52057 invoked by uid 99); 25 Sep 2012 11:23:11 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2012 11:23:11 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_REMOTE_IMAGE X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of infalco@gmail.com designates 74.125.82.172 as permitted sender) Received: from [74.125.82.172] (HELO mail-we0-f172.google.com) (74.125.82.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2012 11:23:05 +0000 Received: by weyu46 with SMTP id u46so1488930wey.31 for ; Tue, 25 Sep 2012 04:22:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=ADz+EgiGdnyveCC2j43ynuTEkuRUxQKkVRyB6Bwg9Og=; b=LTYo9yC6UsYxiBdnC9fNfPXAwmiVHtSeTZP+PDJzUDpW2ZUWBzyHsylQLOilwpK7KR nSzq8+TuvzXg8UDbGH51PNsC+5AI0zxpqxEoE7WzoO6nh3mz+K3Rfca6dt+qbHf4pcJQ cS106gV7t50+fJL9qqgsZ3IpQlwDJedbRMNUaoPS4boC54khqaAcz19gLJ39jnjzQ0hQ NrbgkhB+lndhAF6vnptiMN4EoLXb+Gp2WpMqGdB+p2hN2Qt2drQopmki9B/ICCkljNp5 0EWualN+3EvEEHcM6M2+WSL0YsvUS0fKTyixo3xGqjdgIsDOtwP79NZJiwPCqgFNXLQk K63A== Received: by 10.216.136.230 with SMTP id w80mr9655784wei.199.1348572164999; Tue, 25 Sep 2012 04:22:44 -0700 (PDT) MIME-Version: 1.0 Received: by 10.227.58.81 with HTTP; Tue, 25 Sep 2012 04:22:24 -0700 (PDT) In-Reply-To: References: From: Edward Kibardin Date: Tue, 25 Sep 2012 12:22:24 +0100 Message-ID: Subject: Re: Cassandra Counters To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=0016e6d77d35d949dc04ca84ec3c --0016e6d77d35d949dc04ca84ec3c Content-Type: text/plain; charset=ISO-8859-1 I've recently noticed several threads about Cassandra Counters inconsistencies and started seriously think about possible workarounds like store realtime counters in Redis and dump them daily to Cassandra. So general question, should I rely on Counters if I want 100% accuracy? Thanks, Ed On Tue, Sep 25, 2012 at 8:15 AM, Robin Verlangen wrote: > From my point of view an other problem with using the "standard column > family" for counting is transactions. Cassandra lacks of them, so if you're > multithreaded updating counters, how will you keep track of that? Yes, I'm > aware of software like Zookeeper to do that, however I'm not sure whether > that's the best option. > > I think you should stick with Cassandra counter column families. > > Best regards, > > Robin Verlangen > *Software engineer* > * > * > W http://www.robinverlangen.nl > E robin@us2.nl > > > > Disclaimer: The information contained in this message and attachments is > intended solely for the attention and use of the named addressee and may be > confidential. If you are not the intended recipient, you are reminded that > the information remains the property of the sender. You must not use, > disclose, distribute, copy, print or rely on this e-mail. If you have > received this message in error, please contact the sender immediately and > irrevocably delete this message and any copies. > > > > 2012/9/25 Roshni Rajagopal > >> Thanks for the reply and sorry for being bull - headed. >> >> Once you're past the stage where you've decided its distributed, and >> NoSQL and cassandra out of all the NoSQL options, >> Now to count something, you can do it in different ways in cassandra. >> In all the ways you want to use cassandra's best features of >> availability, tunable consistency , partition tolerance etc. >> >> Given this, what are the performance tradeoffs of using counters vs a >> standard column family for counting. Because as I see if the counter number >> in a counter column family becomes wrong, it will not be 'eventually >> consistent' - you will need intervention to correct it. So the key aspect >> is how much faster would be a counter column family, and at what numbers do >> we start seing a difference. >> >> >> >> >> >> ------------------------------ >> Date: Tue, 25 Sep 2012 07:57:08 +0200 >> Subject: Re: Cassandra Counters >> From: oleksandr.petrov@gmail.com >> To: user@cassandra.apache.org >> >> >> Maybe I'm missing the point, but counting in a standard column family >> would be a little overkill. >> >> I assume that "distributed counting" here was more of a map/reduce >> approach, where Hadoop (+ Cascading, Pig, Hive, Cascalog) would help you a >> lot. We're doing some more complex counting (e.q. based on sets of rules) >> like that. Of course, that would perform _way_ slower than counting >> beforehand. On the other side, you will always have a consistent result for >> a consistent dataset. >> >> On the other hand, if you use things like AMQP or Storm (sorry to put up >> my sentence together like that, as tools are mostly either orthogonal or >> complementary, but I hope you get my point), you could build a topology >> that makes fault-tolerant writes independently of your original write. Of >> course, it would still have a consistency tradeoff, mostly because of race >> conditions and different network latencies etc. >> >> So I would say that building a data model in a distributed system often >> depends more on your problem than on the common patterns, because >> everything has a tradeoff. >> >> Want to have an immediate result? Modify your counter while writing the >> row. >> Can sacrifice speed, but have more counting opportunities? Go with >> offline distributed counting. >> Want to have kind of both, dispatch a message and react upon it, having >> the processing logic and writes decoupled from main application, allowing >> you to care less about speed. >> >> However, I may have missed the point somewhere (early morning, you know), >> so I may be wrong in any given statement. >> Cheers >> >> >> On Tue, Sep 25, 2012 at 6:53 AM, Roshni Rajagopal < >> roshni_rajagopal@hotmail.com> wrote: >> >> Thanks Milind, >> >> Has anyone implemented counting in a standard col family in cassandra, >> when you can have increments and decrements to the count. >> Any comparisons in performance to using counter column families? >> >> Regards, >> Roshni >> >> >> ------------------------------ >> Date: Mon, 24 Sep 2012 11:02:51 -0700 >> Subject: RE: Cassandra Counters >> From: milindparikh@gmail.com >> To: user@cassandra.apache.org >> >> >> IMO >> You would use Cassandra Counters (or other variation of distributed >> counting) in case of having determined that a centralized version of >> counting is not going to work. >> You'd determine the non_feasibility of centralized counting by figuring >> the speed at which you need to sustain writes and reads and reconcile that >> with your hard disk seek times (essentially). >> Once you have "proved" that you can't do centralized counting, the second >> layer of arsenal comes into play; which is distributed counting. >> In distributed counting , the CAP theorem comes into life. & in >> Cassandra, Availability and Network Partitioning trumps over Consistency. >> >> So yes, you sacrifice strong consistency for availability and partion >> tolerance; for eventual consistency. >> On Sep 24, 2012 10:28 AM, "Roshni Rajagopal" < >> roshni_rajagopal@hotmail.com> wrote: >> >> Hi folks, >> >> I looked at my mail below, and Im rambling a bit, so Ill try to >> re-state my queries pointwise. >> >> a) what are the performance tradeoffs on reads & writes between creating >> a standard column family and manually doing the counts by a lookup on a >> key, versus using counters. >> >> b) whats the current state of counters limitations in the latest version >> of apache cassandra? >> >> c) with there being a possibilty of counter values getting out of sync, >> would counters not be recommended where strong consistency is desired. The >> normal benefits of cassandra's tunable consistency would not be applicable, >> as re-tries may cause overstating. So the normal use case is high >> performance, and where consistency is not paramount. >> >> Regards, >> roshni >> >> >> >> ------------------------------ >> From: roshni_rajagopal@hotmail.com >> To: user@cassandra.apache.org >> Subject: Cassandra Counters >> Date: Mon, 24 Sep 2012 16:21:55 +0530 >> >> Hi , >> >> I'm trying to understand if counters are a good fit for my use case. >> Ive watched http://blip.tv/datastax/counters-in-cassandra-5497678 many >> times over now... >> and still need help! >> >> Suppose I have a list of items- to which I can add or delete a set of >> items at a time, and I want a count of the items, without considering >> changing the database or additional components like zookeeper, >> I have 2 options_ the first is a counter col family, and the second is a >> standard one >> 1. List_Counter_CF TotalItems ListId 50 2.List_Std_CF >> >> TimeUUID1 TimeUUID2 TimeUUID3 TimeUUID4 TimeUUID5 ListId 3 70 -20 3 >> -6 >> >> And in the second I can add a new col with every set of items added or >> deleted. Over time this row may grow wide. >> To display the final count, Id need to read the row, slice through all >> columns and add them. >> >> In both cases the writes should be fast, in fact standard col family >> should be faster as there's no read, before write. And for CL ONE write the >> latency should be same. >> For reads, the first option is very good, just read one column for a key >> >> For the second, the read involves reading the row, and adding each column >> value via application code. I dont think there's a way to do math via CQL >> yet. >> There should be not hot spotting, if the key is sharded well. I could >> even maintain the count derived from the List_Std_CF in a separate column >> family which is a standard col family with the final number, but I could do >> that as a separate process immediately after the write to List_Std_CF >> completes, so that its not blocking. I understand cassandra is faster for >> writes than reads, but how slow would Reading by row key be...? Is there >> any number around after how many columns the performance starts >> deteriorating, or how much worse in performance it would be? >> >> The advantage I see is that I can use the same consistency rules as for >> the rest of column families. If quorum for reads & writes, then you get >> strongly consistent values. >> In case of counters I see that in case of timeout exceptions because the >> first replica is down or not responding, there's a chance of the values >> getting messed up, and re-trying can mess it up further. Its not idempotent >> like a standard col family design can be. >> >> If it gets messed up, it would need administrator's help (is there a a >> document on how we could resolve counter values going wrong?) >> >> I believe the rest of the limitations still hold good- has anything >> changed in recent versions? In my opinion, they are not as major as the >> consistency question. >> -removing a counter & then modifying value - behaviour is undetermined >> -special process for counter col family sstable loss( need to remove all >> files) >> -no TTL support >> -no secondary indexes >> >> >> In short, I can recommend counters can be used for analytics or while >> dealing with data where the exact numbers are not important, or >> when its ok to take some time to fix the mismatch, and the performance >> requirements are most important. >> However where the numbers should match , its better to use a std column >> family and a manual implementation. >> >> Please share your thoughts on this. >> >> Regards, >> roshni >> >> >> >> >> >> -- >> alex p >> > > --0016e6d77d35d949dc04ca84ec3c Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I've recently noticed several threads about Cassandra Counters=A0incons= istencies and started seriously think about possible workarounds like store= realtime counters in Redis and dump them daily to Cassandra.
So genera= l question, should I rely on Counters if I want 100% accuracy?

Thanks, Ed

On Tue, Se= p 25, 2012 at 8:15 AM, Robin Verlangen <robin@us2.nl> wrote:
<= blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px= #ccc solid;padding-left:1ex">
From my point of view an other problem with using the "standard c= olumn family" for counting is transactions. Cassandra lacks of them, s= o if you're multithreaded updating counters, how will you keep track of= that? Yes, I'm aware of software like Zookeeper to do that, however I&= #39;m not sure whether that's the best option.

I think you should stick with Cassandra counter column = families.

Best regards,=A0

Ro= bin Verlangen
Software engineer




Disclaimer: The informati= on contained in this message and attachments is intended solely for the att= ention and use of the named addressee and may be confidential. If you are n= ot the intended recipient, you are reminded that the information remains th= e property of the sender. You must not use, disclose, distribute, copy, pri= nt or rely on this e-mail. If you have received this message in error, plea= se contact the sender immediately and irrevocably delete this message and a= ny copies.



2012/9/25 Roshni Rajagopal <= roshni_rajagopal@hotmail.com>
Thanks for the reply and sorry for being bull - headed.

= Once =A0you're past the stage where you've decided its distributed,= and NoSQL and cassandra out of all the NoSQL options,
Now=A0to count something, you can do it in different w= ays in cassandra.=A0
In all the ways you want to use cassand= ra's best features of availability, tunable consistency , partition tol= erance etc.

Given this, what are the performance tr= adeoffs of using counters vs a standard column family for counting. Because= as I see if the counter number in a counter column family becomes wrong, i= t will not be 'eventually consistent' - you will need intervention = to correct it. So the key aspect is how much faster would be a counter colu= mn family, and at what numbers do we start seing a difference.





=
Date: Tue, 25 Sep 2012 07:57:08 +0200
Subject: Re: Cassandra Counter= s
From: oleks= andr.petrov@gmail.com
To: user@cassandra.apache.org


Maybe I'm missing the point, but counting in a standard column fam= ily would be a little overkill.=A0

I assume that &= quot;distributed counting" here was more of a map/reduce approach, whe= re Hadoop (+ Cascading, Pig, Hive, Cascalog) would help you a lot. We'r= e doing some more complex counting (e.q. based on sets of rules) like that.= Of course, that would perform _way_ slower than counting beforehand. On th= e other side, you will always have a consistent result for a consistent dat= aset.

On the other hand, if you use things like AMQP or Storm= (sorry to put up my sentence together like that, as tools are mostly eithe= r orthogonal or complementary, but I hope you get my point), you could buil= d a topology that makes fault-tolerant writes independently of your origina= l write. Of course, it would still have a consistency tradeoff, mostly beca= use of race conditions and different network latencies etc. =A0

So I would say that building a data model in a distribu= ted system often depends more on your problem than on the common patterns, = because everything has a tradeoff.=A0

Want to have= an immediate result? Modify your counter while writing the row.
Can sacrifice speed, but have more counting opportunities? Go with off= line distributed counting.
Want to have kind of both, dispatch a = message and react upon it, having the processing logic and writes decoupled= from main application, allowing you to care less about speed.

However, I may have missed the point somewhere (early m= orning, you know), so I may be wrong in any given statement.
Chee= rs


On Tue, Sep 25, 2012 at 6:53 AM,= Roshni Rajagopal <roshni_rajagopal@hotmail.com> = wrote:
Thanks Milind,

Has anyone implemented counting in a stan= dard col family in cassandra, when you can have increments and decrements t= o the count.=A0
Any comparisons in performance to using counter c= olumn families?=A0

Regards,
Roshni


<= /div>
Date: Mon, 24 Sep 2012 11:02:51 -0700
Subject: RE: Cassandra Co= unters
From: milindparikh@gmail.com
To: user@cas= sandra.apache.org


IMO
You would use Cassandra Counters (or other variation of distributed countin= g) in case of having determined that a centralized version of counting is n= ot going to work.
You'd determine the non_feasibility of centralized counting by figuring= the speed at which you need to sustain writes and reads and reconcile that= with your hard disk seek times (essentially).
Once you have "proved" that you can't do centralized counting= , the second layer of arsenal comes into play; which is distributed countin= g.
In distributed counting , the CAP theorem comes into life. & in Cassand= ra, Availability and Network Partitioning trumps over Consistency.

So yes, you sacrifice strong consistency for availability and partion toler= ance; for eventual consistency.
On Sep 24, 2012 10:28 AM, "Roshni Rajagopal" <roshni_rajagopal@hot= mail.com> wrote:
Hi folks,

=A0 =A0I looked at my mail below, and Im rambl= ing a bit, so Ill try to re-state my queries pointwise.=A0

a) what are the performance tradeoffs on reads & writes betwee= n creating a standard column family and manually doing the counts by a look= up on a key, versus using counters.=A0

b) whats the current state of counters limitations in t= he latest version of apache cassandra?

c) with the= re being a possibilty of counter values getting out of sync, would counters= not be recommended where strong consistency is desired. The normal benefit= s of cassandra's tunable consistency would not be applicable, as re-tri= es may cause overstating. So the normal use case is high performance, and w= here consistency is not paramount.

Regards,
roshni




From: roshni_rajagopal@hotmail.com
To: user@cassandra.apac= he.org
Subject: Cassandra Counters
Date: Mon, 24 Sep 2012 16:21:55 +0530
Hi ,

I'm trying to understand if counters are a good= fit for my use case.
and still need help!

Suppose I have a list of= items- to which I can add or delete a set of items at a time, =A0and I wan= t a count of the items, without considering changing the database =A0or add= itional components like zookeeper,
I have 2 options_ the first is a counter col family, and the second is= a standard one
1. List_Counter_CF<= /td>
TotalItems
ListId 50
2.List_Std_CF

TimeUUID1 TimeUUID2 TimeUUID3 TimeUUID4 TimeUUID5
ListId 3 70 -20 3 -6

An= d in the second I can add a new col with every set of items added or delete= d. Over time this row may grow wide.
To display the final = count, Id need to read the row, slice through all columns and add them.

In both cases the writes should be fast, in fact standa= rd col family should be faster as there's no read, before write. And fo= r CL ONE write the latency should be same.=A0
For reads, the firs= t option is very good, just read one column for a key

For the second, the read involves reading the row, and = adding each column value via application code. I dont think there's a w= ay to do math via CQL yet.
There should be not hot spotting, if t= he key is sharded well. I could even maintain the count derived from the Li= st_Std_CF in a separate column family which is a standard col family with t= he final number, but I could do that as a separate process =A0immediately a= fter the write to List_Std_CF completes, so that its not blocking. =A0I und= erstand cassandra is faster for writes than reads, but how slow would Readi= ng by row key be...? Is there any number around after how many columns the = performance starts deteriorating, or how much worse in performance it would= be?=A0

The advantage I see is that I can use the same consiste= ncy rules as for the rest of column families. If quorum for reads & wri= tes, then you get strongly consistent values.=A0
In case of count= ers I see that in case of timeout=A0exceptions=A0because the first replica = is down or not responding, there's a chance of the values getting messe= d up, and re-trying can mess it up further. Its not idempotent like a stand= ard col family design can be.

If it gets messed up, it would need administrator's= help (is there a a document on how we could resolve counter values going w= rong?)

I believe the rest of the limitations still= hold good- has anything changed in recent versions? In my opinion, they ar= e not as major as the consistency question.
-removing a counter & then modifying value - behaviour is undeterm= ined
-special process for counter col family sstable loss( need t= o remove all files)
-no TTL support
-no secondary index= es


In short, I can recommend counters can b= e used for analytics or while dealing with data where the exact numbers are= not important, or
when its ok to take some time to fix the misma= tch, and the performance requirements are most important.
However where the numbers should match = , its better to use a std column family and a manual implementation.=

Please share your thoughts on this.

Regards,
roshni
=A0
<= /div>



--
alex p


--0016e6d77d35d949dc04ca84ec3c--