Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=to:cc:from
	:subject:date:message-id:content-type:mime-version:in-reply-to;
	 q=dns; s=thelastpickle.com; b=HZ/MjGu2qlFzvWRyftjwnLYb8jxO6JYp1
	BTcfqqXg1TjC0KuhSjREr0YYvy/j8hjb8EHfq9oWC7Jk7ikIBe7PzDp+kPiie2Tx
	c7NdVAH3Ohg41auaFKUd1VQZjtWZV7qQVx/hsNYsQ4/t35ztROqP1sFXPdial8on
	5UM+uP5Gnc=
To: user@cassandra.apache.org
Cc: user@cassandra.apache.org
From: Aaron Morton <aaron@thelastpickle.com>
Subject: Re: Read Latency
Date: Wed, 06 Oct 2010 21:14:49 GMT
Message-id: <72b7f68e-7008-ecf8-98d2-93db25c1cd57@me.com>
Content-Type: multipart/alternative;
 boundary=Apple-Webmail-42--ba50e503-d035-ce0b-be5a-898ab5c8d00d
MIME-Version: 1.0
In-Reply-To: <AANLkTi=SvQ0k1YDcXURxYCXS=uA944xWX=6aURVs6R=S@mail.gmail.com>


--Apple-Webmail-42--ba50e503-d035-ce0b-be5a-898ab5c8d00d
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
    charset=ISO-8859-1;
    format=flowed

Thats a lot of questions, I'll try to answer some...=0A=0ARead/Write laten=
cy as reported for a CF is the time=A0taken=A0to perform a local read on t=
hat node.=A0=0A=0ARead/Write latency reported on the o.a.c.service.Storage=
Proxy are the time taken to process a complete request, including local an=
d remote reads when CL is > ONE. It includes the time taken to compare the=
 values read from the different nodes, which may include running the read =
*again* to request full values from the nodes (rather than digest values).=
=A0=0A=0AWith logging at DEBUG level these messages in the log may give yo=
u some insight into whats happening (Most of this is guess work on my part=
, others please correct me if I am wrong)=0A=0A- =A0strongread/weakread re=
ading digest / data=A0=0A=A0=A0Tells you a read with CL > ONE is starting,=
 digest or data says if the node will return the full data or a hash=0A=A0=
=A0Lines with weakread are CL ONE reads.=A0=0A=0A-=A0quorumResponseHandler=
:=0ATells you the requested CL of nodes have responded to the=A0coordinato=
r=0A=0A-=A0Digest mismatch:=0A=A0=A0Tells you the nodes did not agree and=A0=
coordinator=A0needs to sort it out. I think this will result in Read Repai=
r and the full data (not digest) being read from each node.=A0=0A=0ASo the=
 read path will be slowed down if a RR is needed. This can be helped by no=
detool repair. It can also be slowed if there are a lot fo SSTables on dis=
k, nodetool compact if your friend there.=A0=0A=0AReads to multiple nodes =
are done in parallel, not serial. DO not know enough about the network par=
ts to help you out there.=A0=0A=0AHope that helps you dig a little deeper.=
=0AAaron=0A=0A=0AOn 07 Oct, 2010,at 02:55 AM, Wayne <wav100@gmail.com> wro=
te:=0A=0AI have been seeing some strange trends in read latency that I wan=
ted to throw out there to find some explanations. We are running .6.5 in a=
 10 node cluster rf=3D3. We find that the read latency reported by the cfs=
tats is always about 1/4 of the actual time it takes to get the data back =
to the python client. We are not using any higher level clients, and we us=
ually are doing Quorum reads (rf=3D3). If we ask for 1 copy (vs. 2 for Quo=
rum) it is around 2x the time reported in cfstats. This is true whether we=
 have a .8 ms read or a 5000 ms read. It is always around 4x the time for =
a Quorum read and 2x the time for a single value read. This tells me that =
much of the time waiting for a read has nothing to do with disk random rea=
d latency. This is contrary to what is expected. =0A=0AWhat is that extra =
time being used for? Waiting 2 ms for a read value to the client when the =
value is retrieved in 1ms leaves 1ms that is unexplainable. Is the node be=
ing requested by the client doing some "work" that equal the time spent by=
 the node actually serving up the data? Is this the thrift server packagin=
g up the response to the client?=0A=0AAre reads really more CPU bound? We =
have lower end CPUs in our nodes, is that part of the cause?=0A=0AWhat is =
cfstats actually reporting? Is it not really reporting on ALL of the time =
required to service a read? I assume is not reporting the time to send the=
 result to the requesting node.=0A=0AHow much of this time is network time=
? Would Infiniband or a lower latency network architecture reduce any of t=
hese times? If we want to reduce a 2 ms read to a 1ms read what will help =
us get there? We have cached keys which then gives us a cfstats read laten=
cy < 1ms (~.85) but it still takes 2ms to get to the client (single read).=
=0A=0AWhy does a quorum read double everything? It seems quorum reads are =
serialized and not parallel. Is that true and if so why? Obviously it take=
s more time to get 2 values and compare then get one value but if that is =
always 2x+ then the adjustable consistency of Cassandra comes at a very hi=
gh price. =0A=0AAny other suggestions for decreasing read latency? Faster =
disks don't seem as useful as faster CPUs. We have worked hard to reduce t=
he cfstats reported read latency and have been successful. How can we redu=
ce the time from there to the requesting client? What is the anatomy of a =
read from client request to result? Where does the time usually go and wha=
t can help speed each step up? Caching is the obvious answer but assume we=
 are already caching what can be cached (keys).=0A=0AThanks in advance for=
 any advice or explanations anyone might have.=0A=0A=0A
--Apple-Webmail-42--ba50e503-d035-ce0b-be5a-898ab5c8d00d
Content-Type: multipart/related;
    type="text/html";
    boundary=Apple-Webmail-86--ba50e503-d035-ce0b-be5a-898ab5c8d00d


--Apple-Webmail-86--ba50e503-d035-ce0b-be5a-898ab5c8d00d
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
    charset=ISO-8859-1;

<html><body><div>Thats a lot of questions, I'll try to answer some...</div><div><br></=
div><div>Read/Write latency as reported for a CF is the time&nbsp;taken&nb=
sp;to perform a local read on that node.&nbsp;</div><div><br></div><div>Re=
ad/Write latency reported on the o.a.c.service.StorageProxy are the time t=
aken to process a complete request, including local and remote reads when =
CL is &gt; ONE. It includes the time taken to compare the values read from=
 the different nodes, which may include running the read *again* to reques=
t full values from the nodes (rather than digest values).&nbsp;</div><div>=
<br></div><div>With logging at DEBUG level these messages in the log may g=
ive you some insight into whats happening (Most of this is guess work on m=
y part, others please correct me if I am wrong)</div><div><br></div><div>-=
 &nbsp;strongread/weakread reading digest / data&nbsp;</div><div>&nbsp;&nb=
sp;Tells you a read with CL &gt; ONE is starting, digest or data says if t=
he node will return the full data or a hash</div><div>&nbsp;&nbsp;Lines wi=
th weakread are CL ONE reads.&nbsp;</div><div><br></div><div>-&nbsp;quorum=
ResponseHandler:</div><div>Tells you the requested CL of nodes have respon=
ded to the&nbsp;coordinator</div><div><br></div><div>-&nbsp;Digest mismatc=
h:</div><div>&nbsp;&nbsp;Tells you the nodes did not agree and&nbsp;coordi=
nator&nbsp;needs to sort it out. I think this will result in Read Repair a=
nd the full data (not digest) being read from each node.&nbsp;</div><div><=
br></div><div>So the read path will be slowed down if a RR is needed. This=
 can be helped by nodetool repair. It can also be slowed if there are a lo=
t fo SSTables on disk, nodetool compact if your friend there.&nbsp;</div><=
div><br></div><div>Reads to multiple nodes are done in parallel, not seria=
l. DO not know enough about the network parts to help you out there.&nbsp;=
</div><div><br></div><div>Hope that helps you dig a little deeper.</div><d=
iv>Aaron</div><div><br></div><div><br></div><div>On 07 Oct, 2010,at 02:55 =
AM, Wayne &lt;wav100@gmail.com&gt; wrote:<br><br></div><div><blockquote ty=
pe=3D"cite"><div>I have been seeing some strange trends in read latency th=
at I wanted to throw out there to find some explanations. We are running .=
6.5 in a 10 node cluster rf=3D3. We find that the read latency reported by=
 the cfstats is always about 1/4 of the actual time it takes to get the da=
ta back to the python client. We are not using any higher level clients, a=
nd we usually are doing Quorum reads (rf=3D3). If we ask for 1 copy (vs. 2=
 for Quorum) it is around 2x the time reported in cfstats. This is true wh=
ether we have a .8 ms read or a 5000 ms read. It is always around 4x the t=
ime for a Quorum read and 2x the time for a single value read. This tells =
me that much of the time waiting for a read has nothing to do with disk ra=
ndom read latency. This is contrary to what is expected. <br>=0A<br>What i=
s that extra time being used for? Waiting 2 ms for a read value to the cli=
ent when the value is retrieved in 1ms leaves 1ms that is unexplainable. I=
s the node being requested by the client doing some "work" that equal the =
time spent by the node actually serving up the data? Is this the thrift se=
rver packaging up the response to the client?<br>=0A<br>Are reads really m=
ore CPU bound? We have lower end CPUs in our nodes, is that part of the ca=
use?<br><br>What is cfstats actually reporting? Is it not really reporting=
 on<i> ALL</i> of the time required to service a read? I assume is not rep=
orting the time to send the result to the requesting node.<br>=0A<br>How m=
uch of this time is network time? Would Infiniband or a lower latency netw=
ork architecture reduce any of these times? If we want to reduce a 2 ms re=
ad to a 1ms read what will help us get there? We have cached keys which th=
en gives us a cfstats read latency &lt; 1ms (~.85) but it still takes 2ms =
to get to the client (single read).<br>=0A<br>Why does a quorum read doubl=
e everything? It seems quorum reads are serialized and not parallel. Is th=
at true and if so why? Obviously it takes more time to get 2 values and co=
mpare then get one value but if that is always 2x+ then the adjustable con=
sistency of Cassandra comes at a very high price. <br>=0A<br>Any other sug=
gestions for decreasing read latency? Faster disks don't seem as useful as=
 faster CPUs. We have worked hard to reduce the cfstats reported read late=
ncy and have been successful. How can we reduce the time from there to the=
 requesting client? What is the anatomy of a read from client request to r=
esult? Where does the time usually go and what can help speed each step up=
? Caching is the obvious answer but assume we are already caching what can=
 be cached (keys).<br>=0A<br>Thanks in advance for any advice or explanati=
ons anyone might have.<br><br><br>=0A</div></blockquote></div></body></html>
--Apple-Webmail-86--ba50e503-d035-ce0b-be5a-898ab5c8d00d--

--Apple-Webmail-42--ba50e503-d035-ce0b-be5a-898ab5c8d00d--