Kanwar Sangha
Subject RE: High disk I/O during reads
Date Fri, 22 Mar 2013 22:05:07 GMT
Just as a test - can you disable/reduce compaction throughput and see if that makes a difference
? Compaction eats a lot of I/O.

Jon Scarborough
Sent: 22 March 2013 15:01
To:; Wei Zhu
Subject: Re: High disk I/O during reads

Checked tpstats, there are very few dropped messages.

Checked histograms. Mostly nothing surprising. The vast majority of rows are small, and most
reads only access one or two SSTables.

What I did discover is that of our 5 nodes, one is performing well, with disk I/O in the ballprk
that seems reasonable. The other 4 nodes are doing roughly 4x the disk i/O per second.  Interestingly,
the node that is performing well also seems to be servicing about twice the number of reads
that the other nodes are.

I compared configuration between the node performing well to those that aren't, and so far
haven't found any discrepancies.
On Fri, Mar 22, 2013 at 10:43 AM, Wei Zhu
According to your cfstats, read latency is over 100 ms which is really really slow. I am seeing
less than 3ms reads for my cluster which is on SSD. Can you also check the nodetool cfhistorgram,
it tells you more about the number of SSTable involved and read/write latency. Somtimes average
doesn't tell you the whole storey.
Also check your nodetool tpstats, are there a lot dropped reads?


Jon Scarborough
Sent: Friday, March 22, 2013 9:42:34 AM
Subject: Re: High disk I/O during reads

Key distribution across probably varies a lot from row to row in our case. Most reads would
probably only need to look at a few SSTables, a few might need to look at more.

I don't yet have a deep understanding of C* internals, but I would imagine even the more expensive
use cases would involve something like this:

1) Check the index for each SSTable to determine if part of the row is there.
2) Look at the endpoints of the slice to determine if the data in a particular SSTable is
relevant to the query.
3) Read the chunks of those SSTables, working backwards from the end of the slice until enough
columns have been read to satisfy the limit clause in the query.

So I would have guessed that even the more expensive queries on wide rows typically wouldn't
need to read more than a few hundred KB from disk to do all that. Seems like I'm missing something

Here's the complete CF definition, including compression settings:

CREATE COLUMNFAMILY conversation_text_message (
conversation_key bigint PRIMARY KEY
comment='' AND
read_repair_chance=0.100000 AND
gc_grace_seconds=864000 AND
default_validation=text AND
min_compaction_threshold=4 AND
max_compaction_threshold=32 AND
replicate_on_write=True AND
compaction_strategy_class='SizeTieredCompactionStrategy' AND

Much thanks for any additional ideas.


On Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean
> wrote:

Did you mean to ask "are 'all' your keys spread across all SSTables"? I am guessing at your

I mean I would very well hope my keys are spread across all sstables or otherwise that sstable
should not be there as he has no keys in it ;).

And I know we had HUGE disk size from the duplication in our sstables on size-tiered compaction....we
never ran a major compaction but after we switched to LCS, we went from 300G to some 120G
or something like that which was nice. We only have 300 data point posts / second so not an
extreme write load on 6 nodes as well though these posts causes read to check authorization
and such of our system.


Kanwar Sangha


Date: Friday, March 22, 2013 8:38 AM
To: "<> <mailto:<>
>" <<> <mailto:<>
Subject: RE: High disk I/O during reads

Are your Keys spread across all SSTables ? That will cause every sstable read which will increase
the I/O.

What compaction are you using ?


Jon Scarborough

Sent: 21 March 2013 23:00
To:<> <mailto:<>

Subject: High disk I/O during reads


We've had a 5-node C* cluster (version 1.1.0) running for several months. Up until now we've
mostly been writing data, but now we're starting to service more read traffic. We're seeing
far more disk I/O to service these reads than I would have anticipated.

The CF being queried consists of chat messages. Each row represents a conversation between
two people. Each column represents a message. The column key is composite, consisting of the
message date and a few other bits of information. The CF is using compression.

The query is looking for a maximum of 50 messages between two dates, in reverse order. Usually
the two dates used as endpoints are 30 days ago and the current time. The query in Astyanax
looks like this:

ColumnList<ConversationTextMessageKey> result = keyspace.prepareQuery(CF_CONVERSATION_TEXT_MESSAGE)
textMessageSerializer.makeEndpoint(endDate, Equality.LESS_THAN).toBytes(),
textMessageSerializer.makeEndpoint(startDate, Equality.GREATER_THAN_EQUALS).toBytes(),

We're currently servicing around 30 of these queries per second.

Here's what the cfstats for the CF look like:

Column Family: conversation_text_message
SSTable count: 15
Space used (live): 211762982685
Space used (total): 211762982685
Number of Keys (estimate): 330118528
Memtable Columns Count: 68063
Memtable Data Size: 53093938
Memtable Switch Count: 9743
Read Count: 4313344
Read Latency: 118.831 ms.
Write Count: 817876950
Write Latency: 0.023 ms.
Pending Tasks: 0
Bloom Filter False Postives: 6055
Bloom Filter False Ratio: 0.00260
Bloom Filter Space Used: 686266048
Compacted row minimum size: 87
Compacted row maximum size: 14530764
Compacted row mean size: 1186

On the C* nodes, iostat output like this is typical, and can spike to be much worse:

avg-cpu: %user %nice %system %iowait %steal %idle
1.91 0.00 2.08 30.66 0.50 64.84

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
xvdap1 0.13 0.00 1.07 0 16
xvdb 474.20 13524.53 25.33 202868 380
xvdc 469.87 13455.73 30.40 201836 456
md0 972.13 26980.27 55.73 404704 836

Any thoughts on what could be causing read I/O to the disk from these queries?

Much thanks!


