One more thoughts about Martin's suggestion: is it possible to put the data files into multiple directories that are located in different physical disks? This should help to improve the i/o bottleneck issue.
Has anybody tested the row-caching feature in trunk (shoot for 0.6?)?
Dumped 50mil records into my 2-node cluster overnight, made sure that there's not many data files (around 30 only) per Martin's suggestion. The size of the data directory is 63GB. Now when I read records from the cluster the read latency is still ~44ms, --there's no write happening during the read. And iostats shows that the disk (RAID10, 4 250GB 15k SAS) is saturated:
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda 47.67 67.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda2 47.67 67.67 190.33 17.00 23933.33 677.33 118.70 5.24 25.25 4.64 96.17
sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
CPU usage is low.
Does this mean disk i/o is the bottleneck for my case? Will it help if I increase KCF to cache all sstable index?
Also, this is the almost a read-only mode test, and in reality, our write/read ratio is close to 1:1 so I'm guessing read latency will even go higher in that case because there will be difficult for cassandra to find a good moment to compact the data files that are being busy written.
On Tue, Feb 16, 2010 at 6:06 AM, Brandon Williams <email@example.com> wrote:
On Tue, Feb 16, 2010 at 2:32 AM, Dr. Martin Grabmüller <Martin.Grabmueller@eleven.de> wrote:
In my tests I have observed that good read latency depends on keepingthe number of data files low. In my current test setup, I have stored
1.9 TB of data on a single node, which is in 21 data files, and read
latency is between 10 and 60ms (for small reads, larger read of course
take more time). In earlier stages of my test, I had up to 5000
data files, and read performance was quite bad: my configured 10-second
RPC timeout was regularly encountered.I believe it is known that crossing sstables is O(NlogN) but I'm unable to find the ticket on this at the moment. Perhaps Stu Hood will jump in and enlighten me, but in any case I believe https://issues.apache.org/jira/browse/CASSANDRA-674 will eventually solve it.Keeping write volume low enough that compaction can keep up is one solution, and throwing hardware at the problem is another, if necessary. Also, the row caching in trunk (soon to be 0.6 we hope) helps greatly for repeat hits.-Brandon