Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of roshni_rajagopal@hotmail.com
 designates 65.55.34.141 as permitted sender)
Message-ID: <COL121-W1390A4F329B575BD19C10BFC9D0@phx.gbl>
Content-Type: multipart/alternative;
	boundary="_c99d5a69-8070-4fe7-98d5-e88a30afd58e_"
From: Roshni Rajagopal <roshni_rajagopal@hotmail.com>
To: <user@cassandra.apache.org>
Subject: RE: Cassandra Counters
Date: Tue, 25 Sep 2012 12:06:48 +0530
Importance: Normal
In-Reply-To: 
 <CAA1qPbapsFdXhpWXQsQWGqbm9yu=SMm_Y7kop8e48VG29AXV9w@mail.gmail.com>
References: 
 <COL121-W579388C87029DF6AB7B412FC9E0@phx.gbl>,<COL121-W36A6F4D989B7C5C759D181FC9E0@phx.gbl>,<CALoo1W0be1KNxi0e80gxXYnwCvBzLJ3FsiTeVGh1hfNTzuu5Vg@mail.gmail.com>,<COL121-W238CFFFB0C1957D3DA976AFC9D0@phx.gbl>,<CAA1qPbapsFdXhpWXQsQWGqbm9yu=SMm_Y7kop8e48VG29AXV9w@mail.gmail.com>
MIME-Version: 1.0

--_c99d5a69-8070-4fe7-98d5-e88a30afd58e_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


Thanks for the reply and sorry for being bull - headed.
Once  you're past the stage where you've decided its distributed=2C and NoS=
QL and cassandra out of all the NoSQL options=2CNow to count something=2C y=
ou can do it in different ways in cassandra. In all the ways you want to us=
e cassandra's best features of availability=2C tunable consistency =2C part=
ition tolerance etc.
Given this=2C what are the performance tradeoffs of using counters vs a sta=
ndard column family for counting. Because as I see if the counter number in=
 a counter column family becomes wrong=2C it will not be 'eventually consis=
tent' - you will need intervention to correct it. So the key aspect is how =
much faster would be a counter column family=2C and at what numbers do we s=
tart seing a difference.


Date: Tue=2C 25 Sep 2012 07:57:08 +0200
Subject: Re: Cassandra Counters
From: oleksandr.petrov@gmail.com
To: user@cassandra.apache.org

Maybe I'm missing the point=2C but counting in a standard column family wou=
ld be a little overkill.=20
I assume that "distributed counting" here was more of a map/reduce approach=
=2C where Hadoop (+ Cascading=2C Pig=2C Hive=2C Cascalog) would help you a =
lot. We're doing some more complex counting (e.q. based on sets of rules) l=
ike that. Of course=2C that would perform _way_ slower than counting before=
hand. On the other side=2C you will always have a consistent result for a c=
onsistent dataset.

On the other hand=2C if you use things like AMQP or Storm (sorry to put up =
my sentence together like that=2C as tools are mostly either orthogonal or =
complementary=2C but I hope you get my point)=2C you could build a topology=
 that makes fault-tolerant writes independently of your original write. Of =
course=2C it would still have a consistency tradeoff=2C mostly because of r=
ace conditions and different network latencies etc. =20

So I would say that building a data model in a distributed system often dep=
ends more on your problem than on the common patterns=2C because everything=
 has a tradeoff.=20
Want to have an immediate result? Modify your counter while writing the row=
.
Can sacrifice speed=2C but have more counting opportunities? Go with offlin=
e distributed counting.Want to have kind of both=2C dispatch a message and =
react upon it=2C having the processing logic and writes decoupled from main=
 application=2C allowing you to care less about speed.

However=2C I may have missed the point somewhere (early morning=2C you know=
)=2C so I may be wrong in any given statement.Cheers

On Tue=2C Sep 25=2C 2012 at 6:53 AM=2C Roshni Rajagopal <roshni_rajagopal@h=
otmail.com> wrote:


Thanks Milind=2C
Has anyone implemented counting in a standard col family in cassandra=2C wh=
en you can have increments and decrements to the count. Any comparisons in =
performance to using counter column families?=20

Regards=2CRoshni

Date: Mon=2C 24 Sep 2012 11:02:51 -0700
Subject: RE: Cassandra Counters
From: milindparikh@gmail.com

To: user@cassandra.apache.org

IMO

You would use Cassandra Counters (or other variation of distributed countin=
g) in case of having determined that a centralized version of counting is n=
ot going to work.

You'd determine the non_feasibility of centralized counting by figuring the=
 speed at which you need to sustain writes and reads and reconcile that wit=
h your hard disk seek times (essentially).

Once you have "proved" that you can't do centralized counting=2C the second=
 layer of arsenal comes into play=3B which is distributed counting.

In distributed counting =2C the CAP theorem comes into life. & in Cassandra=
=2C Availability and Network Partitioning trumps over Consistency.=20

=20

So yes=2C you sacrifice strong consistency for availability and partion tol=
erance=3B for eventual consistency.

On Sep 24=2C 2012 10:28 AM=2C "Roshni Rajagopal" <roshni_rajagopal@hotmail.=
com> wrote:


Hi folks=2C
   I looked at my mail below=2C and Im rambling a bit=2C so Ill try to re-s=
tate my queries pointwise.=20
a) what are the performance tradeoffs on reads & writes between creating a =
standard column family and manually doing the counts by a lookup on a key=
=2C versus using counters.=20


b) whats the current state of counters limitations in the latest version of=
 apache cassandra?
c) with there being a possibilty of counter values getting out of sync=2C w=
ould counters not be recommended where strong consistency is desired. The n=
ormal benefits of cassandra's tunable consistency would not be applicable=
=2C as re-tries may cause overstating. So the normal use case is high perfo=
rmance=2C and where consistency is not paramount.


Regards=2Croshni


From: roshni_rajagopal@hotmail.com
To: user@cassandra.apache.org


Subject: Cassandra Counters
Date: Mon=2C 24 Sep 2012 16:21:55 +0530


Hi =2C
I'm trying to understand if counters are a good fit for my use case.Ive wat=
ched http://blip.tv/datastax/counters-in-cassandra-5497678 many times over =
now...

and still need help!
Suppose I have a list of items- to which I can add or delete a set of items=
 at a time=2C  and I want a count of the items=2C without considering chang=
ing the database  or additional components like zookeeper=2C

I have 2 options_ the first is a counter col family=2C and the second is a =
standard one


=20
=20
  1. List_Counter_CF
 =20
 =20
 =20
=20
=20
 =20
  TotalItems
 =20
 =20
 =20
 =20
=20
=20
  ListId
  50
 =20
 =20
 =20
 =20
=20
=20
 =20
 =20
 =20
 =20
 =20
 =20
=20
=20
  2.List_Std_CF


 =20
 =20
 =20
 =20
 =20
=20
=20
 =20
  TimeUUID1
  TimeUUID2
  TimeUUID3
  TimeUUID4
  TimeUUID5
=20
=20
  ListId
  3
  70
  -20
  3
  -6
=20


And in the second I can add a new col with every set of items added or dele=
ted. Over time this row may grow wide.To display the final count=2C Id need=
 to read the row=2C slice through all columns and add them.


In both cases the writes should be fast=2C in fact standard col family shou=
ld be faster as there's no read=2C before write. And for CL ONE write the l=
atency should be same. For reads=2C the first option is very good=2C just r=
ead one column for a key


For the second=2C the read involves reading the row=2C and adding each colu=
mn value via application code. I dont think there's a way to do math via CQ=
L yet.There should be not hot spotting=2C if the key is sharded well. I cou=
ld even maintain the count derived from the List_Std_CF in a separate colum=
n family which is a standard col family with the final number=2C but I coul=
d do that as a separate process  immediately after the write to List_Std_CF=
 completes=2C so that its not blocking.  I understand cassandra is faster f=
or writes than reads=2C but how slow would Reading by row key be...? Is the=
re any number around after how many columns the performance starts deterior=
ating=2C or how much worse in performance it would be?=20


The advantage I see is that I can use the same consistency rules as for the=
 rest of column families. If quorum for reads & writes=2C then you get stro=
ngly consistent values. In case of counters I see that in case of timeout e=
xceptions because the first replica is down or not responding=2C there's a =
chance of the values getting messed up=2C and re-trying can mess it up furt=
her. Its not idempotent like a standard col family design can be.


If it gets messed up=2C it would need administrator's help (is there a a do=
cument on how we could resolve counter values going wrong?)
I believe the rest of the limitations still hold good- has anything changed=
 in recent versions? In my opinion=2C they are not as major as the consiste=
ncy question.

-removing a counter & then modifying value - behaviour is undetermined-spec=
ial process for counter col family sstable loss( need to remove all files)-=
no TTL support-no secondary indexes


In short=2C I can recommend counters can be used for analytics or while dea=
ling with data where the exact numbers are not important=2C orwhen its ok t=
o take some time to fix the mismatch=2C and the performance requirements ar=
e most important.

However where the numbers should match =2C its better to use a std column f=
amily and a manual implementation.
Please share your thoughts on this.


Regards=2Croshni  		 	   		   		 	   		 =20
 		 	   		 =20


--=20
alex p
 		 	   		  =

--_c99d5a69-8070-4fe7-98d5-e88a30afd58e_
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<html>
<head>
<style><!--
.hmmessage P
{
margin:0px=3B
padding:0px
}
body.hmmessage
{
font-size: 10pt=3B
font-family:Tahoma
}
--></style></head>
<body class=3D'hmmessage'><div dir=3D'ltr'>
Thanks for the reply and sorry for being bull - headed.<div><br></div><div>=
Once &nbsp=3Byou're past the stage where you've decided its distributed=2C =
and NoSQL and cassandra out of all the NoSQL options=2C</div><div>Now&nbsp=
=3B<span style=3D"font-size: 10pt=3B ">to count something=2C you can do it =
in different ways in cassandra.&nbsp=3B</span></div><div><span style=3D"fon=
t-size: 10pt=3B ">In all the ways you want to use cassandra's best features=
 of availability=2C tunable consistency =2C partition tolerance etc.</span>=
</div><div><span style=3D"font-size: 10pt=3B "><br></span></div><div><span =
style=3D"font-size: 10pt=3B ">Given this=2C what are the performance tradeo=
ffs of using counters vs a standard column family for counting. Because as =
I see if the counter number in a counter column family becomes wrong=2C it =
will not be 'eventually consistent' - you will need intervention to correct=
 it. So the key aspect is how much faster would be a counter column family=
=2C and at what numbers do we start seing a difference.</span></div><div><s=
pan style=3D"font-size: 10pt=3B "><br></span></div><div><span style=3D"font=
-size: 10pt=3B "><br></span></div><div><br></div><div><br><br><div><div id=
=3D"SkyDrivePlaceholder"></div><hr id=3D"stopSpelling">Date: Tue=2C 25 Sep =
2012 07:57:08 +0200<br>Subject: Re: Cassandra Counters<br>From: oleksandr.p=
etrov@gmail.com<br>To: user@cassandra.apache.org<br><br><div>Maybe I'm miss=
ing the point=2C but counting in a standard column family would be a little=
 overkill.&nbsp=3B</div><div><br></div><div>I assume that "distributed coun=
ting" here was more of a map/reduce approach=2C where Hadoop (+ Cascading=
=2C Pig=2C Hive=2C Cascalog) would help you a lot. We're doing some more co=
mplex counting (e.q. based on sets of rules) like that. Of course=2C that w=
ould perform _way_ slower than counting beforehand. On the other side=2C yo=
u will always have a consistent result for a consistent dataset.</div>
<div><br></div><div>On the other hand=2C if you use things like AMQP or Sto=
rm (sorry to put up my sentence together like that=2C as tools are mostly e=
ither orthogonal or complementary=2C but I hope you get my point)=2C you co=
uld build a topology that makes fault-tolerant writes independently of your=
 original write. Of course=2C it would still have a consistency tradeoff=2C=
 mostly because of race conditions and different network latencies etc. &nb=
sp=3B</div>
<div><br></div><div>So I would say that building a data model in a distribu=
ted system often depends more on your problem than on the common patterns=
=2C because everything has a tradeoff.&nbsp=3B</div><div><br></div><div>Wan=
t to have an immediate result? Modify your counter while writing the row.</=
div>
<div>Can sacrifice speed=2C but have more counting opportunities? Go with o=
ffline distributed counting.</div><div>Want to have kind of both=2C dispatc=
h a message and react upon it=2C having the processing logic and writes dec=
oupled from main application=2C allowing you to care less about speed.</div=
>
<div><br></div><div>However=2C I may have missed the point somewhere (early=
 morning=2C you know)=2C so I may be wrong in any given statement.</div><di=
v>Cheers</div><div><br></div><div><br></div><div class=3D"ecxgmail_quote">O=
n Tue=2C Sep 25=2C 2012 at 6:53 AM=2C Roshni Rajagopal <span dir=3D"ltr">&l=
t=3B<a href=3D"mailto:roshni_rajagopal@hotmail.com">roshni_rajagopal@hotmai=
l.com</a>&gt=3B</span> wrote:<br>
<blockquote class=3D"ecxgmail_quote" style=3D"border-left:1px #ccc solid=3B=
padding-left:1ex">


<div><div dir=3D"ltr">
Thanks Milind=2C<div><br></div><div>Has anyone implemented counting in a st=
andard col family in cassandra=2C when you can have increments and decremen=
ts to the count.&nbsp=3B</div><div>Any comparisons in performance to using =
counter column families?&nbsp=3B</div>
<div><br></div><div>Regards=2C</div><div>Roshni</div><div><br><br><div><div=
></div><hr>Date: Mon=2C 24 Sep 2012 11:02:51 -0700<br>Subject: RE: Cassandr=
a Counters<br>From: <a href=3D"mailto:milindparikh@gmail.com">milindparikh@=
gmail.com</a><br>
To: <a href=3D"mailto:user@cassandra.apache.org">user@cassandra.apache.org<=
/a><div><div class=3D"h5"><br><br>IMO<br>
You would use Cassandra Counters (or other variation of distributed countin=
g) in case of having determined that a centralized version of counting is n=
ot going to work.<br>
You'd determine the non_feasibility of centralized counting by figuring the=
 speed at which you need to sustain writes and reads and reconcile that wit=
h your hard disk seek times (essentially).<br>
Once you have "proved" that you can't do centralized counting=2C the second=
 layer of arsenal comes into play=3B which is distributed counting.<br>
In distributed counting =2C the CAP theorem comes into life. &amp=3B in Cas=
sandra=2C Availability and Network Partitioning trumps over Consistency. <b=
r>
 <br>
So yes=2C you sacrifice strong consistency for availability and partion tol=
erance=3B for eventual consistency.<br>
<div>On Sep 24=2C 2012 10:28 AM=2C "Roshni Rajagopal" &lt=3B<a href=3D"mail=
to:roshni_rajagopal@hotmail.com">roshni_rajagopal@hotmail.com</a>&gt=3B wro=
te:<br><blockquote style=3D"border-left:1px #ccc solid=3Bpadding-left:1ex">


<div><div dir=3D"ltr">
Hi folks=2C<div><br></div><div>&nbsp=3B &nbsp=3BI looked at my mail below=
=2C and Im rambling a bit=2C so Ill try to re-state my queries pointwise.&n=
bsp=3B</div><div><br></div><div>a) what are the performance tradeoffs on re=
ads &amp=3B writes between creating a standard column family and manually d=
oing the counts by a lookup on a key=2C versus using counters.&nbsp=3B</div=
>

<div><br></div><div>b) whats the current state of counters limitations in t=
he latest version of apache cassandra?</div><div><br></div><div>c) with the=
re being a possibilty of counter values getting out of sync=2C would counte=
rs not be recommended where strong consistency is desired. The normal benef=
its of cassandra's tunable consistency would not be applicable=2C as re-tri=
es may cause overstating. So the normal use case is high performance=2C and=
 where consistency is not paramount.</div>

<div><br></div><div>Regards=2C</div><div>roshni</div><div><br></div><div><d=
iv><br><br><div><div></div><hr>From: <a href=3D"mailto:roshni_rajagopal@hot=
mail.com">roshni_rajagopal@hotmail.com</a><br>To: <a href=3D"mailto:user@ca=
ssandra.apache.org">user@cassandra.apache.org</a><br>

Subject: Cassandra Counters<br>Date: Mon=2C 24 Sep 2012 16:21:55 +0530<br><=
br>


<div dir=3D"ltr">
Hi =2C<div><br></div><div>I'm trying to understand if counters are a good f=
it for my use case.</div><div>Ive watched <a href=3D"http://blip.tv/datasta=
x/counters-in-cassandra-5497678" target=3D"_blank">http://blip.tv/datastax/=
counters-in-cassandra-5497678</a> many times over now...</div>

<div>and still need help!</div><div><br></div><div>Suppose I have a list of=
 items- to which I can add or delete a set of items at a time=2C &nbsp=3Ban=
d I want a count of the items=2C without considering changing the database =
&nbsp=3Bor additional components like zookeeper=2C</div>

<div>I have 2 options_ the first is a counter col family=2C and the second =
is a standard one</div><div>


<table border=3D"0" cellpadding=3D"0" cellspacing=3D"0" width=3D"390" style=
=3D"border-collapse:collapse=3Bwidth:390pt">

 <colgroup><col width=3D"65" span=3D"6" style=3D"width:65pt">
 </colgroup><tbody><tr height=3D"15" style=3D"min-height:15.0pt">
  <td height=3D"15" colspan=3D"2" width=3D"130" style=3D"min-height:15.0pt=
=3Bwidth:130pt">1. List_Counter_CF</td><td width=3D"65" style=3D"width:65pt=
"></td>
  <td width=3D"65" style=3D"width:65pt"></td>
  <td width=3D"65" style=3D"width:65pt"></td>
  <td width=3D"65" style=3D"width:65pt"></td>
 </tr>
 <tr height=3D"15" style=3D"min-height:15.0pt">
  <td height=3D"15" style=3D"min-height:15.0pt"></td>
  <td>TotalItems</td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
 </tr>
 <tr height=3D"15" style=3D"min-height:15.0pt">
  <td height=3D"15" style=3D"min-height:15.0pt">ListId</td>
  <td align=3D"right">50</td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
 </tr>
 <tr height=3D"15" style=3D"min-height:15.0pt">
  <td height=3D"15" style=3D"min-height:15.0pt"></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
 </tr>
 <tr height=3D"15" style=3D"min-height:15.0pt">
  <td height=3D"15" style=3D"min-height:15.0pt">2.List_Std_CF<br><br></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
  <td></td>
 </tr>
 <tr height=3D"15" style=3D"min-height:15.0pt">
  <td height=3D"15" style=3D"min-height:15.0pt"></td>
  <td>TimeUUID1</td>
  <td>TimeUUID2</td>
  <td>TimeUUID3</td>
  <td>TimeUUID4</td>
  <td>TimeUUID5</td>
 </tr>
 <tr height=3D"15" style=3D"min-height:15.0pt">
  <td height=3D"15" style=3D"min-height:15.0pt">ListId</td>
  <td align=3D"right">3</td>
  <td align=3D"right">70</td>
  <td align=3D"right">-20</td>
  <td align=3D"right">3</td>
  <td align=3D"right">-6</td>
 </tr>

</tbody></table></div><div><br></div><div><span style=3D"font-size:10pt">An=
d in the second I can add a new col with every set of items added or delete=
d. Over time this row may grow wide.</span></div><div>To display the final =
count=2C Id need to read the row=2C slice through all columns and add them.=
</div>

<div><br></div><div>In both cases the writes should be fast=2C in fact stan=
dard col family should be faster as there's no read=2C before write. And fo=
r CL ONE write the latency should be same.&nbsp=3B</div><div>For reads=2C t=
he first option is very good=2C just read one column for a key</div>

<div><br></div><div>For the second=2C the read involves reading the row=2C =
and adding each column value via application code. I dont think there's a w=
ay to do math via CQL yet.</div><div>There should be not hot spotting=2C if=
 the key is sharded well. I could even maintain the count derived from the =
List_Std_CF in a separate column family which is a standard col family with=
 the final number=2C but I could do that as a separate process &nbsp=3Bimme=
diately after the write to List_Std_CF completes=2C so that its not blockin=
g. &nbsp=3BI understand cassandra is faster for writes than reads=2C but ho=
w slow would Reading by row key be...? Is there any number around after how=
 many columns the performance starts deteriorating=2C or how much worse in =
performance it would be?&nbsp=3B</div>

<div><br></div><div>The advantage I see is that I can use the same consiste=
ncy rules as for the rest of column families. If quorum for reads &amp=3B w=
rites=2C then you get strongly consistent values.&nbsp=3B</div><div>In case=
 of counters I see that in case of timeout&nbsp=3Bexceptions&nbsp=3Bbecause=
 the first replica is down or not responding=2C there's a chance of the val=
ues getting messed up=2C and re-trying can mess it up further. Its not idem=
potent like a standard col family design can be.</div>

<div><br></div><div>If it gets messed up=2C it would need administrator's h=
elp (is there a a document on how we could resolve counter values going wro=
ng?)</div><div><br></div><div>I believe the rest of the limitations still h=
old good- has anything changed in recent versions? In my opinion=2C they ar=
e not as major as the consistency question.</div>

<div>-removing a counter &amp=3B then modifying value - behaviour is undete=
rmined</div><div>-special process for counter col family sstable loss( need=
 to remove all files)</div><div>-no TTL support</div><div>-no secondary ind=
exes</div>

<div><br></div><div><br></div><div>In short=2C I can recommend counters can=
 be used for analytics or while dealing with data where the exact numbers a=
re not important=2C or</div><div>when its ok to take some time to fix the m=
ismatch=2C and the performance requirements are most important.</div>

<div><span style=3D"font-size:10pt">However where the numbers should match =
=2C its better to use a std column family and a manual implementation.</spa=
n></div><div><br></div><div>Please share your thoughts on this.</div><div>
<br>
</div><div>Regards=2C</div><div>roshni</div><div>&nbsp=3B</div> 		 	   		  =
</div></div></div></div> 		 	   		  </div></div>
</blockquote></div></div></div></div></div> 		 	   		  </div></div>
</blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>alex p<br></=
div></div> 		 	   		  </div></body>
</html>=

--_c99d5a69-8070-4fe7-98d5-e88a30afd58e_--