Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from
	:mime-version:content-type:subject:date:in-reply-to:to
	:references:message-id; q=dns; s=thelastpickle.com; b=cQIWjBJtyc
	ccDIdOVWxJ3t1/wZKtJN16CgnUNk6P3g81C1VuhzMALdV0TgCnsyEgwhv7CLpmWJ
	Ril5DS5gm+Su3vx9cTlWbDpoZHDKKkCfL2p/acruoJtAo+KLa4jWZWi6IeRf3f+j
	Ci9BsmpZJJ7baw9WPlYkZgQ9H56sfnMvU=
From: aaron morton <aaron@thelastpickle.com>
Mime-Version: 1.0 (Apple Message framework v1244.3)
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_D42D9E32-17FE-490E-B7D8-F7021AC6DE1D"
Subject: Re: A few questions on row caching and read consistency ONE
Date: Fri, 19 Aug 2011 10:17:25 +1200
In-Reply-To: 
 <CAENxBwxGUjEZXzedMDdOyxCodWi2FiZj85um_p=x9yfiSfK=Kg@mail.gmail.com>
To: user@cassandra.apache.org
References: 
 <AC6CDD3FAD22434F92D66C5E6979E28005F7520A51@34093-MBX-C15.mex07a.mlsrvr.com>
 <CAENxBwxCDmYdLOHM6OUcAxL2koMgB0xxMio3gamj6r8YVFxgvQ@mail.gmail.com>
 <AC6CDD3FAD22434F92D66C5E6979E28005F7520B0C@34093-MBX-C15.mex07a.mlsrvr.com>
 <CAENxBwxGUjEZXzedMDdOyxCodWi2FiZj85um_p=x9yfiSfK=Kg@mail.gmail.com>
Message-Id: <95C9EE2F-53A6-4C45-80A5-DAA801336B17@thelastpickle.com>


--Apple-Mail=_D42D9E32-17FE-490E-B7D8-F7021AC6DE1D
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=windows-1252

Those numbers sound achievable. As with any scaling start with the =
default config and see how you go, a 5ms response time is certainly =
reasonable as are the throughput numbers. =20

e.g. If you started with 6 nodes, rf 3, with read repair turned on

20k ops -> 12k reads and 8k writes=20
X3 because of Read Repair > 36K reads , 24K writes for the whole cluster=20=

per node 6k reads, 5k writes=20
per node read latency must be below 0.00016 secs
per node write latency must be below 0.0002 secs

0.2 ms write latency is fine, 0.16 ms read latency may need some =
attention. But you can scale up from there, and you always want to have =
enough capacity to handle down nodes etc.=20

It would may be interesting to understand how many of you 2 billion keys =
are hot to get a better understanding of the cache needs.=20

When you are going through the dev cycle take a look at nodetool =
cfhistograms . Rows that receive writes over a long time period can =
become fragmented and have a longer read path. Background =
http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

Hope that helps.=20
=20
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 19/08/2011, at 3:27 AM, Edward Capriolo wrote:

>=20
>=20
> On Thu, Aug 18, 2011 at 10:36 AM, Stephen Henderson =
<stephen.henderson@cognitivematch.com> wrote:
> Thanks Ed/Aaron, that really helped a lot.
>=20
> =20
> Just to clarify on the question of writes (sorry, I worded that badly) =
- do write operations insert rows into the cache on all nodes in the =
replica set or does the cache only get populated on reads?
>=20
> =20
> Aaron =96 in terms of scale, our ultimate goal is to achieve 99% reads =
under 5ms (ideally <1ms) with upto 20,000 operations a second (split =
60/40 read/write) and upto 2 billion keys. That=92s the 12-18 month plan =
at least, short-term we=92ll be more like 1000 ops/sec and 10 million =
keys which I think cassandra could cope with comfortably. We=92re =
currently working out what the row-size will be, but hoping to be under =
2kb max.  Consistency isn=92t massively important. Our use case is as a =
user-profile store for serving optimised advert-content with quite tight =
restrictions on response time, so we have say 10ms to gather as much =
data about a user as possible before we have to make a decision on which =
creative to serve. If we can read a profile from the store in this time =
we can serve a personalised ad with a higher chance of engagement so =
low-latency is key requirement.
>=20
> =20
> Edward =96 thanks for the link to the presentation slides. A bit =
off-topic, but have you ever had a look at CouchBase (previously =
=93membase=94)? It=92s basically memcached with persistence, =
fault-tolerance and online scaling. It=92s the main alternative platform =
we=92re considering for this project and on paper it sounds perfect, =
though we have a few concerns about it (mainly lack of active community, =
another nosql platform to learn and general uncertainty over the =
upcoming 2.0 release). We=92re hoping to do some stress test comparison =
tests between the two in the near future and I=92ll try to post the =
results if they=92re not too company-specific.
>=20
> =20
> Thanks again,
>=20
> Stephen
>=20
> =20
> From: Edward Capriolo [mailto:edlinuxguru@gmail.com]=20
> Sent: 18 August 2011 14:14
> To: user@cassandra.apache.org
> Subject: Re: A few questions on row caching and read consistency ONE
>=20
> =20
> =20
> On Thu, Aug 18, 2011 at 5:01 AM, Stephen Henderson =
<stephen.henderson@cognitivematch.com> wrote:
>=20
> Hi,
>=20
> We're currently in the planning stage of a new project which needs a =
low latency, persistent key/value store with a roughly 60:40 read/write =
split. We're trying to establish if Cassandra is a good fit for this and =
in particular what the hardware requirements would be to have the =
majority of rows cached in memory (other nosql platforms like =
Couchbase/Membase seem like a more natural fit but we're already =
reasonably familiar with cassandra and would rather stick with what we =
know if it can work).
>=20
> If anyone could help answer/clarify the following questions it would =
be a great help (all assume that row-caching is enabled for the column =
family).
>=20
> Q. If we're using read consistency ONE does the read request get sent =
to all nodes in the replica set and the first to reply is returned (i.e. =
all replica nodes will then have that row in their cache), OR does the =
request only get sent to a single node in the replica set? If it's the =
latter would the same node generally be used for all requests to the =
same key or would it always be a random node in the replica set? (i.e. =
if we have multiple reads for one key in quick succession would this =
entail potentially multiple disk lookups until all nodes in the set have =
been hit?).
>=20
> Q. Related to the above, if only one node recieves the request would =
the client (hector in this case) know which node to send the request to =
directly or would there be potentially one extra network hop involved =
(client -> random node -> node with key).
>=20
> Q. Is it possible to do a warm cache load of the most recently =
accessed keys on node startup or would we have to do this with a client =
app?
>=20
> Q. With write consistency ANY is it correct that following a write =
request all nodes in the replica set will end up with that row in their =
cache, as well as on disk, once they receive the write? i.e. total cache =
size is (cache_memory_per_node * num_nodes) / num_replicas.
>=20
> Q. If the cluster only has a single column family, random partitioning =
and no secondary indexes, is there a good metric for estimating how much =
heap space we would need to leave aside for everything that isn't the =
row-cache? Would it be proportional to the row-cache size or fairly =
constant?
>=20
>=20
> Thanks,
> Stephen
>=20
>=20
> Stephen Henderson - Lead Developer (Onsite), Cognitive Match
> stephen.henderson@cognitivematch.com | http://www.cognitivematch.com
>=20
> =20
> I did a small presentation on this topic a while back. =
http://www.edwardcapriolo.com/roller/edwardcapriolo/resource/memcache.odp
>=20
>=20
> 1.
>=20
> a) All reads go to all replica nodes. Even those at READ.ONE. UNLESS =
you lower the read_repair_chance for the column family.=20
>=20
> b) Read could hit random nodes same node unless you confirgure dynamic =
snitch to pin the request to a single node. This is described in the =
cassandra.yaml
>=20
> =20
> 2. Hector and no client that I know of routes requests to proper nodes =
based on topology. No information of know of has proven this matters.
>=20
> =20
> 3. Cassandra allows you to save your caches so your node will start up =
warn (saving large rowcache is hard, large key cache is easy)
>=20
> =20
> 4. Write.ANY would not change how caching works.
>=20
> =20
> 5. There are some calculations out there based on size of rows. One of =
the newer features of cassandra is it automatically resizes the row =
cache under memory pressure now. You still have to feel it out but you =
do not have to worry about setting it too high as much anymore.
>=20
> =20
> One more note. I you have mentioned the row cache which is awesome it =
you can utilize it correctly and your use case is prefect but key cache =
+ page cache can server very fast as well.
>=20
> =20
> Thank you,
>=20
> Edward
>=20
> =20
> =20
>=20
>=20
> Wait membase is couchbase? I thought it was northscale? (I can not =
keep up). It seems to have coordinators or masters.  =
http://www.slideshare.net/tim.lossen.de/an-introduction-to-membase=20
> Any solution where all the read write traffic travels through a master =
I do not believe to be scalable. Other solutions that use a master for =
coordination election but read or write directly to nodes are "more" =
scalable but more fragile.=20
>=20
> Q. Why does every scalable architecture except Cassandra seem to have =
master nodes ? :)
>=20
> It is not in the YCSB so hard to say how fast it is or how well it =
performs.=20


--Apple-Mail=_D42D9E32-17FE-490E-B7D8-F7021AC6DE1D
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
	charset=windows-1252

<html><head></head><body style=3D"word-wrap: break-word; =
-webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Those =
numbers sound achievable. As with any scaling start with the default =
config and see how you go, a 5ms response time is certainly reasonable =
as are the throughput numbers. &nbsp;<div><br></div><div>e.g. If you =
started with 6 nodes, rf 3, with read repair turned =
on</div><div><br></div><div>20k ops -&gt; 12k reads and 8k =
writes&nbsp;</div><div>X3 because of Read Repair &gt; 36K reads , 24K =
writes for the whole cluster&nbsp;</div><div>per node 6k reads, 5k =
writes&nbsp;</div><div>per node read latency must be below 0.00016 =
secs</div><div>per node write latency must be below 0.0002 =
secs</div><div><br></div><div>0.2 ms write latency is fine, 0.16 ms read =
latency may need some attention.&nbsp;But you can scale up from there, =
and you always want to have enough capacity to handle down nodes =
etc.&nbsp;</div><div><div><br></div><div>It would may be interesting to =
understand how many of you 2 billion keys are hot to get a better =
understanding of the cache needs.&nbsp;<div><br></div><div>When you are =
going through the dev cycle take a look at nodetool cfhistograms . Rows =
that receive writes over a long time period can become fragmented and =
have a longer read path. Background&nbsp;<a =
href=3D"http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/">htt=
p://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/</a></div><div><=
br></div><div>Hope that helps.&nbsp;</div><div>&nbsp;</div><div><div>
<span class=3D"Apple-style-span" style=3D"border-collapse: separate; =
color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; =
font-variant: normal; font-weight: normal; letter-spacing: normal; =
line-height: normal; orphans: 2; text-align: auto; text-indent: 0px; =
text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; =
-webkit-border-horizontal-spacing: 0px; -webkit-border-vertical-spacing: =
0px; -webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><span =
class=3D"Apple-style-span" style=3D"border-collapse: separate; color: =
rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: =
normal; font-weight: normal; letter-spacing: normal; line-height: =
normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: =
normal; widows: 2; word-spacing: 0px; -webkit-border-horizontal-spacing: =
0px; -webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; "><span class=3D"Apple-style-span" =
style=3D"border-collapse: separate; color: rgb(0, 0, 0); font-family: =
Helvetica; font-style: normal; font-variant: normal; font-weight: =
normal; letter-spacing: normal; line-height: normal; orphans: 2; =
text-indent: 0px; text-transform: none; white-space: normal; widows: 2; =
word-spacing: 0px; -webkit-border-horizontal-spacing: 0px; =
-webkit-border-vertical-spacing: 0px; =
-webkit-text-decorations-in-effect: none; -webkit-text-size-adjust: =
auto; -webkit-text-stroke-width: 0px; font-size: medium; "><div =
style=3D"word-wrap: break-word; -webkit-nbsp-mode: space; =
-webkit-line-break: after-white-space; =
"><div><div>-----------------</div><div>Aaron Morton</div><div>Freelance =
Cassandra Developer</div><div>@aaronmorton</div><div><a =
href=3D"http://www.thelastpickle.com">http://www.thelastpickle.com</a></di=
v></div></div></span></div></span></span>
</div>

<br><div><div>On 19/08/2011, at 3:27 AM, Edward Capriolo wrote:</div><br =
class=3D"Apple-interchange-newline"><blockquote type=3D"cite"><br><br><div=
 class=3D"gmail_quote">On Thu, Aug 18, 2011 at 10:36 AM, Stephen =
Henderson <span dir=3D"ltr">&lt;<a =
href=3D"mailto:stephen.henderson@cognitivematch.com">stephen.henderson@cog=
nitivematch.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote"=
 style=3D"margin:0 0 0 .8ex;border-left:1px #ccc =
solid;padding-left:1ex;">
<div lang=3D"EN-GB" link=3D"blue" vlink=3D"purple"><div><p =
class=3D"MsoNormal"><span style=3D"font-size:11.0pt;color:#1F497D">Thanks =
Ed/Aaron, that really helped a lot.</span></p><div><span =
style=3D"font-size:11.0pt;color:#1F497D">&nbsp;</span><br =
class=3D"webkit-block-placeholder"></div><p class=3D"MsoNormal"><span =
style=3D"font-size:11.0pt;color:#1F497D">Just to clarify on the question =
of writes (sorry, I worded that badly) - do write operations insert rows =
into the cache on all nodes in the replica set or does the cache only =
get populated on reads? </span></p><div><span =
style=3D"font-size:11.0pt;color:#1F497D">&nbsp;</span><br =
class=3D"webkit-block-placeholder"></div><p class=3D"MsoNormal"><span =
style=3D"font-size:11.0pt;color:#1F497D">Aaron =96 in terms of scale, =
our ultimate goal is to achieve 99% reads under 5ms (ideally &lt;1ms) =
with upto 20,000 operations a second (split 60/40 read/write) and upto 2 =
billion keys. That=92s the 12-18 month plan at least, short-term we=92ll =
be more like 1000 ops/sec and 10 million keys which I think cassandra =
could cope with comfortably. We=92re currently working out what the =
row-size will be, but hoping to be under 2kb max. &nbsp;Consistency =
isn=92t massively important. Our use case is as a user-profile store for =
serving optimised advert-content with quite tight restrictions on =
response time, so we have say 10ms to gather as much data about a user =
as possible before we have to make a decision on which creative to =
serve. If we can read a profile from the store in this time we can serve =
a personalised ad with a higher chance of engagement so low-latency is =
key requirement. </span></p><div><span =
style=3D"font-size:11.0pt;color:#1F497D">&nbsp;</span><br =
class=3D"webkit-block-placeholder"></div><p class=3D"MsoNormal"><span =
style=3D"font-size:11.0pt;color:#1F497D">Edward =96 thanks for the link =
to the presentation slides. A bit off-topic, but have you ever had a =
look at CouchBase (previously =93membase=94)? It=92s basically memcached =
with persistence, fault-tolerance and online scaling. It=92s the main =
alternative platform we=92re considering for this project and on paper =
it sounds perfect, though we have a few concerns about it (mainly lack =
of active community, another nosql platform to learn and general =
uncertainty over the upcoming 2.0 release). We=92re hoping to do some =
stress test comparison tests between the two in the near future and I=92ll=
 try to post the results if they=92re not too =
company-specific.</span></p><div><span =
style=3D"font-size:11.0pt;color:#1F497D">&nbsp;</span><br =
class=3D"webkit-block-placeholder"></div><p class=3D"MsoNormal"><span =
style=3D"font-size:11.0pt;color:#1F497D">Thanks again,</span></p><p =
class=3D"MsoNormal"><span =
style=3D"font-size:11.0pt;color:#1F497D">Stephen</span></p><div><span =
style=3D"font-size:11.0pt;color:#1F497D">&nbsp;</span><br =
class=3D"webkit-block-placeholder"></div><p class=3D"MsoNormal"><b><span =
lang=3D"EN-US" style=3D"font-size:10.0pt">From:</span></b><span =
lang=3D"EN-US" style=3D"font-size:10.0pt"> Edward Capriolo [mailto:<a =
href=3D"mailto:edlinuxguru@gmail.com" =
target=3D"_blank">edlinuxguru@gmail.com</a>] <br>
<b>Sent:</b> 18 August 2011 14:14<br><b>To:</b> <a =
href=3D"mailto:user@cassandra.apache.org" =
target=3D"_blank">user@cassandra.apache.org</a><br><b>Subject:</b> Re: A =
few questions on row caching and read consistency =
ONE</span></p><div>&nbsp;<br class=3D"webkit-block-placeholder"></div><div=
 style=3D"margin-bottom: 12pt; ">&nbsp;<br =
class=3D"webkit-block-placeholder"></div><div><p class=3D"MsoNormal">On =
Thu, Aug 18, 2011 at 5:01 AM, Stephen Henderson &lt;<a =
href=3D"mailto:stephen.henderson@cognitivematch.com" =
target=3D"_blank">stephen.henderson@cognitivematch.com</a>&gt; =
wrote:</p><p class=3D"MsoNormal" =
style=3D"margin-bottom:12.0pt">Hi,<br><br>We're currently in the =
planning stage of a new project which needs a low latency, persistent =
key/value store with a roughly 60:40 read/write split. We're trying to =
establish if Cassandra is a good fit for this and in particular what the =
hardware requirements would be to have the majority of rows cached in =
memory (other nosql platforms like Couchbase/Membase seem like a more =
natural fit but we're already reasonably familiar with cassandra and =
would rather stick with what we know if it can work).<br>
<br>If anyone could help answer/clarify the following questions it would =
be a great help (all assume that row-caching is enabled for the column =
family).<br><br>Q. If we're using read consistency ONE does the read =
request get sent to all nodes in the replica set and the first to reply =
is returned (i.e. all replica nodes will then have that row in their =
cache), OR does the request only get sent to a single node in the =
replica set? If it's the latter would the same node generally be used =
for all requests to the same key or would it always be a random node in =
the replica set? (i.e. if we have multiple reads for one key in quick =
succession would this entail potentially multiple disk lookups until all =
nodes in the set have been hit?).<br>
<br>Q. Related to the above, if only one node recieves the request would =
the client (hector in this case) know which node to send the request to =
directly or would there be potentially one extra network hop involved =
(client -&gt; random node -&gt; node with key).<br>
<br>Q. Is it possible to do a warm cache load of the most recently =
accessed keys on node startup or would we have to do this with a client =
app?<br><br>Q. With write consistency ANY is it correct that following a =
write request all nodes in the replica set will end up with that row in =
their cache, as well as on disk, once they receive the write? i.e. total =
cache size is (cache_memory_per_node * num_nodes) / num_replicas.<br>
<br>Q. If the cluster only has a single column family, random =
partitioning and no secondary indexes, is there a good metric for =
estimating how much heap space we would need to leave aside for =
everything that isn't the row-cache? Would it be proportional to the =
row-cache size or fairly constant?<br>
<br><br>Thanks,<br>Stephen<br><br><br>Stephen Henderson - Lead Developer =
(Onsite), Cognitive Match<br><span style=3D"color:#888888"><a =
href=3D"mailto:stephen.henderson@cognitivematch.com" =
target=3D"_blank">stephen.henderson@cognitivematch.com</a> | <a =
href=3D"http://www.cognitivematch.com/" =
target=3D"_blank">http://www.cognitivematch.com</a></span></p>
</div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div><div><p =
class=3D"MsoNormal">I did a small presentation on this topic a while =
back. <a =
href=3D"http://www.edwardcapriolo.com/roller/edwardcapriolo/resource/memca=
che.odp" =
target=3D"_blank">http://www.edwardcapriolo.com/roller/edwardcapriolo/reso=
urce/memcache.odp</a></p>
</div><div><p class=3D"MsoNormal"><br>1. </p></div><div><p =
class=3D"MsoNormal">a) All reads go to all replica nodes. Even those at =
READ.ONE. UNLESS you lower the read_repair_chance for the column =
family.&nbsp;</p></div><div><p class=3D"MsoNormal">
b) Read could hit random nodes same node unless you confirgure dynamic =
snitch to pin the request to a single node. This is described in the =
cassandra.yaml</p></div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div><div><p =
class=3D"MsoNormal">
2. Hector and no client that I know of routes requests to proper nodes =
based on topology. No information of know of has proven this =
matters.</p></div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div><div><p =
class=3D"MsoNormal">3. Cassandra allows you to save your caches so your =
node will start up warn (saving large rowcache is hard, large key cache =
is easy)</p>
</div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div><div><p =
class=3D"MsoNormal">4. Write.ANY would not change how caching =
works.</p></div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div><div><p =
class=3D"MsoNormal">5. There are some calculations out there based on =
size of rows. One of the newer features of cassandra is it automatically =
resizes the row cache under memory pressure now. You still have to feel =
it out but you do not have to worry about setting it too high as much =
anymore.</p>
</div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div><div><p =
class=3D"MsoNormal">One more note. I you have mentioned the row cache =
which is awesome it you can utilize it correctly and your use case is =
prefect but key cache + page cache can server very fast as well.</p>
</div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div><div><p =
class=3D"MsoNormal">Thank you,</p></div><div><p =
class=3D"MsoNormal">Edward</p></div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div><div><div>&nbsp;<br =
class=3D"webkit-block-placeholder"></div></div></div></div></blockquote>
</div><div><br></div><div><br></div><div>Wait membase is couchbase? I =
thought it was northscale? (I can not keep up). It seems to have =
coordinators or masters. &nbsp;<a =
href=3D"http://www.slideshare.net/tim.lossen.de/an-introduction-to-membase=
">http://www.slideshare.net/tim.lossen.de/an-introduction-to-membase</a>&n=
bsp;</div>
<div>Any solution where all the read write traffic travels through a =
master I do not believe to be scalable. Other solutions that use a =
master for coordination election but read or write directly to nodes are =
"more" scalable but more fragile.&nbsp;</div>
<div><br></div><div>Q. Why does every scalable architecture except =
Cassandra seem to have master nodes ? :)</div><div><br></div><div>It is =
not in the YCSB so hard to say how fast it is or how well it =
performs.&nbsp;</div>
</blockquote></div><br></div></div></div></body></html>=

--Apple-Mail=_D42D9E32-17FE-490E-B7D8-F7021AC6DE1D--