Victor, the data is a single composite column <String, String, Integer> with a string value. The key is a UTF8 string
I can reproduce this exact behavior with only 100 of these items on a 2GB heap.

Also, please see -XX:+PrintGCDetails output below, i'm not convinced it's GC. With a 100 items no garbage collection is logged at all.

here's the GC output during the read
 
[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     281480 bytes,     281480 total
: 17322K->308K(19136K), 0.0037375 secs] 28742K->11885K(83008K), 0.0038170 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     261360 bytes,     261360 total
: 17332K->386K(19136K), 0.0034069 secs] 28909K->12098K(83008K), 0.0034849 secs] [Times: user=0.00 sys=0.01, real=0.00 secs] 
[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     241336 bytes,     241336 total
: 17410K->383K(19136K), 0.0035252 secs] 29122K->12209K(83008K), 0.0036062 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 



thanks,
Andras


On 31 Jul 2012, at 16:04, Viktor Jevdokimov wrote:

What is a data load? Does it fits in RAM?
 
I bet itís due to GC. Since this is a 1 node only, dynamic snitch is out of scope.
 
 


Best regards / Pagarbiai
Viktor Jevdokimov
Senior Developer

Phone: +370 5 212 3063, Fax +370 5 261 0453
J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
Follow us on Twitter: @adforminsider
What is Adform: watch this short video

Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.

From: Andras Szerdahelyi [mailto:andras.szerdahelyi@ignitionone.com] 
Sent: Tuesday, July 31, 2012 15:53
To: user@cassandra.apache.org
Subject: Read latency skyrockets after 95-98th percentile
 
hey list,
 
i've been trying to understand why we are seeing rather nasty read latency peaks in as much as 2% of our total read ( not sure what the underlying thrift call is, should be get_slice ) requests that we have been **unable to tie to garbage collection or blocking I/O.** 
This is what i mean by "nasty" : 
Mean: 1.186341, Max: 41.912 stDev: 2.2490502737495333, 75th:1.0242499999999999, 95th: 2.1103499999999977, 98th: 5.2102600000000034, 99th: 13.24835000000001, 999th: 41.89454000000041
 
This is measured around the Astyanax / Hector wrappers for running column slice queries and iterating over their results. FWIW they perform very similarly.
 
Now, the mean is awesome - 1.2ms for a read is really really fast, i think thrift overhead makes up for probably 1ms alone. What i can't live with is pretty much everything thats beyond the 98th percentile. Perhaps my expectations are misaligned to begin with and this is great, but the dataset under test that yielded this result ( see below ) is extremely small - only a 1000 items per test made up of a few small dynamic composite columns  and the read is being done sequentially, from a single threaded client. I just don't understand whats causing the peaks.
 
Here's the complete run-down on the setup:
The data is dynamic composite columns with string or integer values. The DynamicComposites themselves are made up of 3 Components each: <String,String,Integer> and <String,Integer,Integer>
 
The query is getting all columns ( in this particular test, that will be a single column ) of  one of the composite types ( <String, String, Integer> or <String, Integer, Integer> ) for a particular string key.
( How this is achieved is more or less at https://gist.github.com/3207894 )
 
 
The production cluster is set up as follows:
30-40 read/sec clients from multiple hosts, NetworkTopologyStrategy, 2 replicas per DC, read consistency: ONE, 2DCs, 2 nodes per DC )
8GB multi-core physical hosts
most settings on default in cassandra.yaml
 
 
In an attempt to isolate the problem, i've done the following ( this is where the results above are coming from but we see roughly the same numbers in prod )
 
- created a short unit test around the ( two ) classes that wrap the Astyanax and Hector libraries in our application, in which i'm only testing the read performance
- in setUp() i start Cassandra with 
 
CassandraStoreKeeperTest.cassandra = new CassandraDaemon();
CassandraStoreKeeperTest.cassandra.init(null);
CassandraStoreKeeperTest.cassandra.start();
 
- same cassandra yaml as in prod, file at https://gist.github.com/3207894 ( mostly defaults )
- i pass the following flags to the JVM running my unit test:  -server -Xmx2G and all the GC flags from cassandra-env.sh, plus the jamm java agent
- i warm up the embedded cassandra with 1k reads and 1k writes
- the test is 1000 single threaded, sequential reads via Hector and Astyanax ( no significant differences between the two ) via thrift to the embedded single-node Cassandra instance
- i read back the 1000 items saved in the warm-up
 
The results are
Mean: 1.186341, Max: 41.912 stDev: 2.2490502737495333, 75th:1.0242499999999999, 95th: 2.1103499999999977, 98th: 5.2102600000000034, 99th: 13.24835000000001, 999th: 41.89454000000041
 
here's the GC output during the read
 
[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     281480 bytes,     281480 total
: 17322K->308K(19136K), 0.0037375 secs] 28742K->11885K(83008K), 0.0038170 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     261360 bytes,     261360 total
: 17332K->386K(19136K), 0.0034069 secs] 28909K->12098K(83008K), 0.0034849 secs] [Times: user=0.00 sys=0.01, real=0.00 secs] 
[GC [ParNew
Desired survivor size 1081344 bytes, new threshold 1 (max 1)
- age   1:     241336 bytes,     241336 total
: 17410K->383K(19136K), 0.0035252 secs] 29122K->12209K(83008K), 0.0036062 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
 
Doesn't look like GC ( i could be wrong ) - what else could it be ?
 
all suggestions are highly appreciated..
Thanks!
Andras
 
 
ps: this is with Cassandra 1.1.0