cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Daniel Doubleday <>
Subject Re: Dynamic Snitch / Read Path Questions
Date Fri, 17 Dec 2010 17:09:21 GMT
> the purpose of your thread is: How far are you away from being I/O
> bound (say in terms of % utilization - last column of iostat -x 1 -
> assuming you don't have a massive RAID underneath the block device)

No my cheap boss didn't want to by me a stack of these

But seriously: we don't know yet what the best way in terms of TCO is. Maybe its worth investing
2k in SSDs if that machine could than handle the load of 3.

> when compaction/AESis *not* running? I.e., how much in relative terms,
> in terms of "time spent by disks servicing requests" is added by
> compaction/AES?

Can't really say in terms of util% because we only monitor IO waits in zabbix. Now with our
cluster running smoothly I'd say compactions adds around 15-20%.
In terms of IO waits we saw our graphs jumped during compactions

- from 20 - 30% to 50%  with 'ok' load (reqs where handled at around 100ms max and no messages
dropped) and
- from 50% - 80/90% during peak hours. Things got ugly then

> Are your values in generally largish (say a few kb or some such)or
> very small (5-50 bytes) or somewhere in between? I've been trying to
> collect information when people report compaction/repair killing their
> performance. My hypothesis is that most sever issues are for data sets
> where compaction becomes I/O bound rather than CPU bound (for those
> that have seen me say this a gazillion times I must be sounding like
> I'm a stuck LP record); and this would tend to be expected with larger
> and fewer values as opposed to smaller and more numerous values as the
> latter is much more expensive in terms of CPU cycles per byte
> compacted. Further I expect CPU bound compaction to be a problem very
> infrequently in comparison. I'm trying to confirm or falsify the
> hypothesis.

Well we have 4 CFs with different characteristics but it seems that what made things go wrong
was a CF with ~2k cols. I have never seen CPU user time over 30% on any of the nodes. So I
second your hypothesis

> -- 
> / Peter Schuller

View raw message