From user-return-27904-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Jul 31 12:53:20 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 77162D3EB for ; Tue, 31 Jul 2012 12:53:20 +0000 (UTC) Received: (qmail 18052 invoked by uid 500); 31 Jul 2012 12:53:18 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 18036 invoked by uid 500); 31 Jul 2012 12:53:18 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 18026 invoked by uid 99); 31 Jul 2012 12:53:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Jul 2012 12:53:17 +0000 X-ASF-Spam-Status: No, hits=-0.1 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_MED,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [216.82.249.147] (HELO mail29.messagelabs.com) (216.82.249.147) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 31 Jul 2012 12:53:11 +0000 X-Env-Sender: andras.szerdahelyi@ignitionone.com X-Msg-Ref: server-10.tower-29.messagelabs.com!1343739169!23726254!1 X-Originating-IP: [208.52.173.250] X-StarScan-Received: X-StarScan-Version: 6.6.1.2; banners=-,-,- X-VirusChecked: Checked Received: (qmail 7453 invoked from network); 31 Jul 2012 12:52:49 -0000 Received: from mail.dentsunetwork.com (HELO mail.dentsunetwork.com) (208.52.173.250) by server-10.tower-29.messagelabs.com with AES128-SHA encrypted SMTP; 31 Jul 2012 12:52:49 -0000 Received: from ATL02MB02.corp.local ([fe80::7997:c980:b031:df37]) by ATL02HUB02.corp.local ([::1]) with mapi id 14.01.0355.002; Tue, 31 Jul 2012 08:52:49 -0400 From: Andras Szerdahelyi To: "user@cassandra.apache.org" Subject: Read latency skyrockets after 95-98th percentile Thread-Topic: Read latency skyrockets after 95-98th percentile Thread-Index: AQHNbxtjYaCtybASLkeUuzLKWUAX2g== Date: Tue, 31 Jul 2012 12:52:49 +0000 Message-ID: <6063A104-DF1E-4C0D-92C0-2410EDCE969D@dentsunetwork.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.0.90.2] Content-Type: multipart/alternative; boundary="_000_6063A104DF1E4C0D92C02410EDCE969Ddentsunetworkcom_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_6063A104DF1E4C0D92C02410EDCE969Ddentsunetworkcom_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable hey list, i've been trying to understand why we are seeing rather nasty read latency = peaks in as much as 2% of our total read ( not sure what the underlying thr= ift call is, should be get_slice ) requests that we have been **unable to t= ie to garbage collection or blocking I/O.** This is what i mean by "nasty" : Mean: 1.186341, Max: 41.912 stDev: 2.2490502737495333, 75th:1.0242499999999= 999, 95th: 2.1103499999999977, 98th: 5.2102600000000034, 99th: 13.248350000= 00001, 999th: 41.89454000000041 This is measured around the Astyanax / Hector wrappers for running column s= lice queries and iterating over their results. FWIW they perform very simil= arly. Now, the mean is awesome - 1.2ms for a read is really really fast, i think = thrift overhead makes up for probably 1ms alone. What i can't live with is = pretty much everything thats beyond the 98th percentile. Perhaps my expecta= tions are misaligned to begin with and this is great, but the dataset under= test that yielded this result ( see below ) is extremely small - only a 10= 00 items per test made up of a few small dynamic composite columns and the= read is being done sequentially, from a single threaded client. I just don= 't understand whats causing the peaks. Here's the complete run-down on the setup: The data is dynamic composite columns with string or integer values. The Dy= namicComposites themselves are made up of 3 Components each: and The query is getting all columns ( in this particular test, that will be a = single column ) of one of the composite types ( = or ) for a particular string key. ( How this is achieved is more or less at https://gist.github.com/3207894 ) The production cluster is set up as follows: 30-40 read/sec clients from multiple hosts, NetworkTopologyStrategy, 2 repl= icas per DC, read consistency: ONE, 2DCs, 2 nodes per DC ) 8GB multi-core physical hosts most settings on default in cassandra.yaml In an attempt to isolate the problem, i've done the following ( this is whe= re the results above are coming from but we see roughly the same numbers in= prod ) - created a short unit test around the ( two ) classes that wrap the Astyan= ax and Hector libraries in our application, in which i'm only testing the r= ead performance - in setUp() i start Cassandra with CassandraStoreKeeperTest.cassandra =3D new CassandraDaemon(); CassandraStoreKeeperTest.cassandra.init(null); CassandraStoreKeeperTest.cassandra.start(); - same cassandra yaml as in prod, file at https://gist.github.com/3207894 (= mostly defaults ) - i pass the following flags to the JVM running my unit test: -server -Xmx= 2G and all the GC flags from cassandra-env.sh, plus the jamm java agent - i warm up the embedded cassandra with 1k reads and 1k writes - the test is 1000 single threaded, sequential reads via Hector and Astyana= x ( no significant differences between the two ) via thrift to the embedded= single-node Cassandra instance - i read back the 1000 items saved in the warm-up The results are Mean: 1.186341, Max: 41.912 stDev: 2.2490502737495333, 75th:1.0242499999999= 999, 95th: 2.1103499999999977, 98th: 5.2102600000000034, 99th: 13.248350000= 00001, 999th: 41.89454000000041 here's the GC output during the read [GC [ParNew Desired survivor size 1081344 bytes, new threshold 1 (max 1) - age 1: 281480 bytes, 281480 total : 17322K->308K(19136K), 0.0037375 secs] 28742K->11885K(83008K), 0.0038170 s= ecs] [Times: user=3D0.01 sys=3D0.00, real=3D0.01 secs] [GC [ParNew Desired survivor size 1081344 bytes, new threshold 1 (max 1) - age 1: 261360 bytes, 261360 total : 17332K->386K(19136K), 0.0034069 secs] 28909K->12098K(83008K), 0.0034849 s= ecs] [Times: user=3D0.00 sys=3D0.01, real=3D0.00 secs] [GC [ParNew Desired survivor size 1081344 bytes, new threshold 1 (max 1) - age 1: 241336 bytes, 241336 total : 17410K->383K(19136K), 0.0035252 secs] 29122K->12209K(83008K), 0.0036062 s= ecs] [Times: user=3D0.00 sys=3D0.00, real=3D0.00 secs] Doesn't look like GC ( i could be wrong ) - what else could it be ? all suggestions are highly appreciated.. Thanks! Andras ps: this is with Cassandra 1.1.0 --_000_6063A104DF1E4C0D92C02410EDCE969Ddentsunetworkcom_ Content-Type: text/html; charset="us-ascii" Content-ID: <7422D36924B9574F96582D75C4826F8E@corp.local> Content-Transfer-Encoding: quoted-printable hey list,

i've been trying to understand why we are seeing rather nasty read lat= ency peaks in as much as 2% of our total read ( not sure what the unde= rlying thrift call is, should be get_slice ) requests that we have bee= n **unable to tie to garbage collection or blocking I/O.** 
This is what i mean by "nasty" : 
Mean: 1.186341, Max: 41.912 stDev: 2.2490502737495333, 75th:1.02424999= 99999999, 95th: 2.1103499999999977, 98th: 5.2102600000000034, 99th: 13.2483= 5000000001, 999th: 41.89454000000041

This is measured around the Astyanax / Hector wrappers for running col= umn slice queries and iterating over their results. FWIW they perform very = similarly.

Now, the mean is awesome - 1.2ms for a read is really really fast, i t= hink thrift overhead makes up for probably 1ms alone. What i can't live wit= h is pretty much everything thats beyond the 98th percentile. Perhaps my ex= pectations are misaligned to begin with and this is great, but the dataset under test that yielded this resul= t ( see below ) is extremely small - only a 1000 items per test made up of = a few small dynamic composite columns  and the read is being done sequ= entially, from a single threaded client. I just don't understand whats causing the peaks.

Here's the complete run-down on the setup:
The data is dynamic composite columns with string or integer values. T= he DynamicComposites themselves are made up of 3 Components each: <Strin= g,String,Integer> and <String,Integer,Integer>

The query is getting all columns ( in this particular test, that will = be a single column ) of  one of the composite types ( <String, Stri= ng, Integer> or <String, Integer, Integer> ) for a particular stri= ng key.
( How this is achieved is more or less at https://gist.github.com/3207894 )


The production cluster is set up as follows:
30-40 read/sec clients from multiple hosts, NetworkTopologyStrategy, 2= replicas per DC, read consistency: ONE, 2DCs, 2 nodes per DC )
8GB multi-core physical hosts
most settings on default in cassandra.yaml


In an attempt to isolate the problem, i've done the following ( this i= s where the results above are coming from but we see roughly the same numbe= rs in prod )

- created a short unit test around the ( two ) classes that wrap the A= styanax and Hector libraries in our application, in which i'm only testing = the read performance
- in setUp() i start Cassandra with 

CassandraSt= oreKeeperTest.cassandra =3D new CassandraDaemon();
CassandraSt= oreKeeperTest.cassandra.init(null);
CassandraSt= oreKeeperTest.cassandra.start();

- same cassandra yaml as in prod, file at https://gist.github.com/3207894 ( mostly defaults = )
- i pass the following flags to the JVM running my unit = test:  -server -Xmx2G and all the GC flags from cassandra-= env.sh, plus the jamm java agent
- i warm up the embedded cassandra with 1k reads and 1k writes
- the test is 1000 single threaded, sequential reads via Hector and As= tyanax ( no significant differences between the two ) via thrift to the emb= edded single-node Cassandra instance
- i read back the 1000 items saved in the warm-up

The results are
Mean: 1.186341, Max: 41.912 stDev: 2.2490502737495333, 75th:1.02424999= 99999999, 95th: 2.1103499999999977, 98th: 5.2102600000000034, 99th: 13.2483= 5000000001, 999th: 41.89454000000041

here's the GC output during the read

[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     281480 bytes,     281480 total=
: 17322K->308K(19136K), 0.0037375 secs] 28742K->11885K(83008K), = 0.0038170 secs] [Times: user=3D0.01 sys=3D0.00, real=3D0.01 secs] 
[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     261360 bytes,     261360 total=
: 17332K->386K(19136K), 0.0034069 secs] 28909K->12098K(83008K), = 0.0034849 secs] [Times: user=3D0.00 sys=3D0.01, real=3D0.00 secs] 
[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     241336 bytes,     241336 total=
: 17410K->383K(19136K), 0.0035252 secs] 29122K->12209K(83008K), = 0.0036062 secs] [Times: user=3D0.00 sys=3D0.00, real=3D0.00 secs] 

Doesn't look like GC ( i could be wrong ) - what else could it be ?

all suggestions are highly appreciated..
Thanks!
Andras


ps: this is with Cassandra 1.1.0




--_000_6063A104DF1E4C0D92C02410EDCE969Ddentsunetworkcom_--