Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 6843 invoked from network); 6 Oct 2010 21:15:23 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 6 Oct 2010 21:15:23 -0000 Received: (qmail 35287 invoked by uid 500); 6 Oct 2010 21:15:21 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 35251 invoked by uid 500); 6 Oct 2010 21:15:21 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 35243 invoked by uid 99); 6 Oct 2010 21:15:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Oct 2010 21:15:21 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a50.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Oct 2010 21:15:13 +0000 Received: from homiemail-a50.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a50.g.dreamhost.com (Postfix) with ESMTP id 367796F8072 for ; Wed, 6 Oct 2010 14:14:51 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=to:cc:from :subject:date:message-id:content-type:mime-version:in-reply-to; q=dns; s=thelastpickle.com; b=HZ/MjGu2qlFzvWRyftjwnLYb8jxO6JYp1 BTcfqqXg1TjC0KuhSjREr0YYvy/j8hjb8EHfq9oWC7Jk7ikIBe7PzDp+kPiie2Tx c7NdVAH3Ohg41auaFKUd1VQZjtWZV7qQVx/hsNYsQ4/t35ztROqP1sFXPdial8on 5UM+uP5Gnc= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=to:cc :from:subject:date:message-id:content-type:mime-version: in-reply-to; s=thelastpickle.com; bh=/co1QnqOY8GlbaJ2heyA1vs18xo =; b=LdlLFkIFmOhlzfGyTkBiSBQcXi+jOT5Ks0MH+WB39zC+VThkgN2wuCKwk8q Q4JgNlE3nhp1US1T9W+k7ZRnK0W1CiQT6t3qZGuo2E55h9DLImyEjex4Bp8SUATt xs19B4eIYutb8Y1zqL/85sqlO59z4eHfoInkAMYwe9PSUuuo= Received: from localhost (webms.mac.com [17.148.16.116]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a50.g.dreamhost.com (Postfix) with ESMTPSA id 27E266F806B; Wed, 6 Oct 2010 14:14:51 -0700 (PDT) To: user@cassandra.apache.org Cc: user@cassandra.apache.org From: Aaron Morton Subject: Re: Read Latency Date: Wed, 06 Oct 2010 21:14:49 GMT X-Mailer: MobileMe Mail (1C3203) Message-id: <72b7f68e-7008-ecf8-98d2-93db25c1cd57@me.com> Content-Type: multipart/alternative; boundary=Apple-Webmail-42--ba50e503-d035-ce0b-be5a-898ab5c8d00d MIME-Version: 1.0 In-Reply-To: X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Webmail-42--ba50e503-d035-ce0b-be5a-898ab5c8d00d Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=ISO-8859-1; format=flowed Thats a lot of questions, I'll try to answer some...=0A=0ARead/Write laten= cy as reported for a CF is the time=A0taken=A0to perform a local read on t= hat node.=A0=0A=0ARead/Write latency reported on the o.a.c.service.Storage= Proxy are the time taken to process a complete request, including local an= d remote reads when CL is > ONE. It includes the time taken to compare the= values read from the different nodes, which may include running the read = *again* to request full values from the nodes (rather than digest values).= =A0=0A=0AWith logging at DEBUG level these messages in the log may give yo= u some insight into whats happening (Most of this is guess work on my part= , others please correct me if I am wrong)=0A=0A- =A0strongread/weakread re= ading digest / data=A0=0A=A0=A0Tells you a read with CL > ONE is starting,= digest or data says if the node will return the full data or a hash=0A=A0= =A0Lines with weakread are CL ONE reads.=A0=0A=0A-=A0quorumResponseHandler= :=0ATells you the requested CL of nodes have responded to the=A0coordinato= r=0A=0A-=A0Digest mismatch:=0A=A0=A0Tells you the nodes did not agree and=A0= coordinator=A0needs to sort it out. I think this will result in Read Repai= r and the full data (not digest) being read from each node.=A0=0A=0ASo the= read path will be slowed down if a RR is needed. This can be helped by no= detool repair. It can also be slowed if there are a lot fo SSTables on dis= k, nodetool compact if your friend there.=A0=0A=0AReads to multiple nodes = are done in parallel, not serial. DO not know enough about the network par= ts to help you out there.=A0=0A=0AHope that helps you dig a little deeper.= =0AAaron=0A=0A=0AOn 07 Oct, 2010,at 02:55 AM, Wayne wro= te:=0A=0AI have been seeing some strange trends in read latency that I wan= ted to throw out there to find some explanations. We are running .6.5 in a= 10 node cluster rf=3D3. We find that the read latency reported by the cfs= tats is always about 1/4 of the actual time it takes to get the data back = to the python client. We are not using any higher level clients, and we us= ually are doing Quorum reads (rf=3D3). If we ask for 1 copy (vs. 2 for Quo= rum) it is around 2x the time reported in cfstats. This is true whether we= have a .8 ms read or a 5000 ms read. It is always around 4x the time for = a Quorum read and 2x the time for a single value read. This tells me that = much of the time waiting for a read has nothing to do with disk random rea= d latency. This is contrary to what is expected. =0A=0AWhat is that extra = time being used for? Waiting 2 ms for a read value to the client when the = value is retrieved in 1ms leaves 1ms that is unexplainable. Is the node be= ing requested by the client doing some "work" that equal the time spent by= the node actually serving up the data? Is this the thrift server packagin= g up the response to the client?=0A=0AAre reads really more CPU bound? We = have lower end CPUs in our nodes, is that part of the cause?=0A=0AWhat is = cfstats actually reporting? Is it not really reporting on ALL of the time = required to service a read? I assume is not reporting the time to send the= result to the requesting node.=0A=0AHow much of this time is network time= ? Would Infiniband or a lower latency network architecture reduce any of t= hese times? If we want to reduce a 2 ms read to a 1ms read what will help = us get there? We have cached keys which then gives us a cfstats read laten= cy < 1ms (~.85) but it still takes 2ms to get to the client (single read).= =0A=0AWhy does a quorum read double everything? It seems quorum reads are = serialized and not parallel. Is that true and if so why? Obviously it take= s more time to get 2 values and compare then get one value but if that is = always 2x+ then the adjustable consistency of Cassandra comes at a very hi= gh price. =0A=0AAny other suggestions for decreasing read latency? Faster = disks don't seem as useful as faster CPUs. We have worked hard to reduce t= he cfstats reported read latency and have been successful. How can we redu= ce the time from there to the requesting client? What is the anatomy of a = read from client request to result? Where does the time usually go and wha= t can help speed each step up? Caching is the obvious answer but assume we= are already caching what can be cached (keys).=0A=0AThanks in advance for= any advice or explanations anyone might have.=0A=0A=0A --Apple-Webmail-42--ba50e503-d035-ce0b-be5a-898ab5c8d00d Content-Type: multipart/related; type="text/html"; boundary=Apple-Webmail-86--ba50e503-d035-ce0b-be5a-898ab5c8d00d --Apple-Webmail-86--ba50e503-d035-ce0b-be5a-898ab5c8d00d Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=ISO-8859-1;
Thats a lot of questions, I'll try to answer some...

Read/Write latency as reported for a CF is the time taken&nb= sp;to perform a local read on that node. 

Re= ad/Write latency reported on the o.a.c.service.StorageProxy are the time t= aken to process a complete request, including local and remote reads when = CL is > ONE. It includes the time taken to compare the values read from= the different nodes, which may include running the read *again* to reques= t full values from the nodes (rather than digest values). 
=
With logging at DEBUG level these messages in the log may g= ive you some insight into whats happening (Most of this is guess work on m= y part, others please correct me if I am wrong)

-=  strongread/weakread reading digest / data 
 &nb= sp;Tells you a read with CL > ONE is starting, digest or data says if t= he node will return the full data or a hash
  Lines wi= th weakread are CL ONE reads. 

- quorum= ResponseHandler:
Tells you the requested CL of nodes have respon= ded to the coordinator

- Digest mismatc= h:
  Tells you the nodes did not agree and coordi= nator needs to sort it out. I think this will result in Read Repair a= nd the full data (not digest) being read from each node. 
<= br>
So the read path will be slowed down if a RR is needed. This= can be helped by nodetool repair. It can also be slowed if there are a lo= t fo SSTables on disk, nodetool compact if your friend there. 
<= div>
Reads to multiple nodes are done in parallel, not seria= l. DO not know enough about the network parts to help you out there. =

Hope that helps you dig a little deeper.
Aaron


On 07 Oct, 2010,at 02:55 = AM, Wayne <wav100@gmail.com> wrote:

I have been seeing some strange trends in read latency th= at I wanted to throw out there to find some explanations. We are running .= 6.5 in a 10 node cluster rf=3D3. We find that the read latency reported by= the cfstats is always about 1/4 of the actual time it takes to get the da= ta back to the python client. We are not using any higher level clients, a= nd we usually are doing Quorum reads (rf=3D3). If we ask for 1 copy (vs. 2= for Quorum) it is around 2x the time reported in cfstats. This is true wh= ether we have a .8 ms read or a 5000 ms read. It is always around 4x the t= ime for a Quorum read and 2x the time for a single value read. This tells = me that much of the time waiting for a read has nothing to do with disk ra= ndom read latency. This is contrary to what is expected.
=0A
What i= s that extra time being used for? Waiting 2 ms for a read value to the cli= ent when the value is retrieved in 1ms leaves 1ms that is unexplainable. I= s the node being requested by the client doing some "work" that equal the = time spent by the node actually serving up the data? Is this the thrift se= rver packaging up the response to the client?
=0A
Are reads really m= ore CPU bound? We have lower end CPUs in our nodes, is that part of the ca= use?

What is cfstats actually reporting? Is it not really reporting= on ALL of the time required to service a read? I assume is not rep= orting the time to send the result to the requesting node.
=0A
How m= uch of this time is network time? Would Infiniband or a lower latency netw= ork architecture reduce any of these times? If we want to reduce a 2 ms re= ad to a 1ms read what will help us get there? We have cached keys which th= en gives us a cfstats read latency < 1ms (~.85) but it still takes 2ms = to get to the client (single read).
=0A
Why does a quorum read doubl= e everything? It seems quorum reads are serialized and not parallel. Is th= at true and if so why? Obviously it takes more time to get 2 values and co= mpare then get one value but if that is always 2x+ then the adjustable con= sistency of Cassandra comes at a very high price.
=0A
Any other sug= gestions for decreasing read latency? Faster disks don't seem as useful as= faster CPUs. We have worked hard to reduce the cfstats reported read late= ncy and have been successful. How can we reduce the time from there to the= requesting client? What is the anatomy of a read from client request to r= esult? Where does the time usually go and what can help speed each step up= ? Caching is the obvious answer but assume we are already caching what can= be cached (keys).
=0A
Thanks in advance for any advice or explanati= ons anyone might have.


=0A
--Apple-Webmail-86--ba50e503-d035-ce0b-be5a-898ab5c8d00d-- --Apple-Webmail-42--ba50e503-d035-ce0b-be5a-898ab5c8d00d--