incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Edward Capriolo <edlinuxg...@gmail.com>
Subject Re: Read Latency Degradation
Date Fri, 17 Dec 2010 19:05:28 GMT
On Fri, Dec 17, 2010 at 12:26 PM, Daniel Doubleday
<daniel.doubleday@gmx.net> wrote:
> How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> What is the hit rate of caches? high, 90%+
>
> If your heap allows it I would definitely try to give more ram for fs cache.
> Your not using row cache so I don't see what cassandra would gain from so
> much memory.
> A question about your tests:
> I assume that they run isolated (you load test one cf at a time) and the
> results are the same byte-wise?
> So the only difference is that one time you are reading from a larger file?
> Do you see the same IO load in both tests? Do you use mem-mapped io? And if
> so are the number of page faults the same in both tests?
> In the end it could just be more physical movements of the disc heads with
> larger files ...
>
> On Dec 17, 2010, at 5:46 PM, Wayne wrote:
>
> Below are some answers to your questions. We have wide rows (what we like
> about Cassandra) and I wonder if that plays into this? We have been loading
> 1 keyspace in our cluster heavily in the last week so it is behind in
> compaction for that keyspace. I am not even looking at those read latency
> times as there are as many as 100+ sstables. Compaction will run tomorrow
> for all nodes (weekend is our slow time) and I will test the read latency
> there. For the keyspace/CFs that are already well compacted we are seeing a
> steady increase in read latency as the total sstable size grows and a linear
> relationship between our different keyspaces cfs sizes and the read latency
> for reads.
>
> How many nodes? 10 - 16 cores each (2 x quad ht cpus)
> How much ram per node? 24gb
> What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0)
> for data
> Is your ring balanced? yes, random partitioned very evenly
> How many column families? 4 CFs x 3 Keyspaces
> How much ram is dedicated to cassandra? 12gb heap (probably too high?)
> What type of caching are you using? Key caching
> What are the sizes of caches? 500k-1m values for 2 of the CFs
> What is the hit rate of caches? high, 90%+
> What does your disk utiliztion|CPU|Memory look like at peak times? Disk goes
> to 90%+ under heavy read load. CPU load high as well. Latency does not
> change that much for single reads vs. under load (30 threads). We can keep
> current read latency up to 25-30 read threads if no writes or compaction is
> going on. We are worried about what we see in terms of latency for a single
> read.
> What are your average mean|max row size from cfstats? 30k avg/5meg max for
> one CF and 311k avg/855k max for the other.
> On average for a given sstable how large is the data bloom and index files?
> 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k filter,
> 18meg index for the other.
>
> Thanks.
>
>
>
> On Fri, Dec 17, 2010 at 10:58 AM, Edward Capriolo <edlinuxguru@gmail.com>
> wrote:
>>
>> On Fri, Dec 17, 2010 at 8:21 AM, Wayne <wav100@gmail.com> wrote:
>> > We have been testing Cassandra for 6+ months and now have 10TB in 10
>> > nodes
>> > with rf=3. It is 100% real data generated by real code in an almost
>> > production level mode. We have gotten past all our stability issues,
>> > java/cmf issues, etc. etc. now to find the one thing we "assumed" may
>> > not be
>> > true. Our current production environment is mysql with extensive
>> > partitioning. We have mysql tables with 3-4 billion records and our
>> > query
>> > performance is the same as with 1 million records (< 100ms).
>> >
>> > For those of us really trying to manage large volumes of data memory is
>> > not
>> > an option in any stretch of the imagination. Our current data volume
>> > once
>> > placed within Cassandra ignoring growth should be around 50 TB. We run
>> > manual compaction once a week (absolutely required to keep ss table
>> > counts
>> > down) and it is taking a very long amount of time. Now that our nodes
>> > are
>> > past 1TB I am worried it will take more than a day. I was hoping
>> > everyone
>> > would respond to my posting with something must be wrong, but instead I
>> > am
>> > hearing you are off the charts good luck and be patient. Scary to say
>> > the
>> > least given our current investment in Cassandra. Is it true/expected
>> > that
>> > read latency will get worse in a linear fashion as the ss table size
>> > grows?
>> >
>> > Can anyone talk me off the fence here? We have 9 MySQL servers that now
>> > serve up 15+TB of data. Based on what we have seen we need 100 Cassandra
>> > nodes with rf=3 to give us good read latency (by keeping the node data
>> > sizes
>> > down). The cost/value equation just does not add up.
>> >
>> > Thanks in advance for any advice/experience you can provide.
>> >
>> >
>> > On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday
>> > <daniel.doubleday@gmx.net>
>> > wrote:
>> >>
>> >> On Dec 16, 2010, at 11:35 PM, Wayne wrote:
>> >>
>> >> > I have read that read latency goes up with the total data size, but
>> >> > to
>> >> > what degree should we expect a degradation in performance? What is
>> >> > the
>> >> > "normal" read latency range if there is such a thing for a small
>> >> > slice of
>> >> > scol/cols? Can we really put 2TB of data on a node and get good read
>> >> > latency
>> >> > querying data off of a handful of CFs? Any experience or explanations
>> >> > would
>> >> > be greatly appreciated.
>> >>
>> >> If you really mean 2TB per node I strongly advise you to perform
>> >> thorough
>> >> testing with real world column sizes and the read write load you
>> >> expect. Try
>> >> to load test at least with a test cluster / data that represents one
>> >> replication group. I.e. RF=3 -> 3 nodes. And test with the consistency
>> >> level
>> >> you want to use. Also test ring operations (repair, adding nodes,
>> >> moving
>> >> nodes) while under expected load/
>> >>
>> >> Combined with 'a handful of CFs' I would assume that you are expecting
>> >> a
>> >> considerable write load. You will get massive compaction load and with
>> >> that
>> >> data size the file system cache will suffer big time. You'll need loads
>> >> of
>> >> RAM and still ...
>> >>
>> >> I can only speak about 0.6 but ring management operations will become a
>> >> nightmare and you will have very long running repairs.
>> >>
>> >> The cluster behavior changes massively with different access patterns
>> >> (cold vs warm data) and data sizes. So you have to understand yours and
>> >> test
>> >> it. I think most generic load tests are mainly marketing instruments
>> >> and I
>> >> believe this is especially true for cassandra.
>> >>
>> >> Don't want to sound negative (I am a believer and don't regret our
>> >> investment) but cassandra is no silver bullet. You really need to know
>> >> what
>> >> you are doing.
>> >>
>> >> Cheers,
>> >> Daniel
>> >
>>
>> Yes major compactions for large sets of data do take a long time
>> (360GB takes me about 6 hours).
>>
>> You said "needing to compact to keep the sstable count low". This is
>> not a good sign. My sstable counts sawtooth between 8-15 per CF
>> through the day. If you are in a scenario where the SSTables are
>> growing all day and only catch up at night, and you have tuned
>> memtables, then your need more nodes likely. This means that your
>> cluster can not really keep up with your write traffic. You know
>> cassandra can take bursts of writes well, but if you are at the case
>> where your sstables count is getting higher you are essentially
>> failing behind. (You may not need 100 nodes like you are suggesting
>> but possibly a few to get you over the fence.)
>>
>> I do run major compactions at night, but not on every night on every
>> node. I do one a node a night to make sure these are splayed out over
>> the week, With deletes on non-major compactions you may not need to do
>> this, but we add and remove a lot of data per day so I find I have
>> to/should. Since the nights are quite for us anyway.
>>
>> As for how many nodes you need...What works out better ?
>> Big Iron: 1x (2 TB 64 GB RAM ) cost ? power ? Rack size ?
>> Small factor: 4x (500GB  16GB RAM) cost ? power ? Rack Size ?
>> Generally I think most are running the "small factor" type deployment,
>> and generally this works better by avoiding 2GB compactions!
>>
>> Is it true that read latency grows linearly with sstable size? No (but
>> it could be true in your case).
>>
>> As for your specific problems. More info is needed.
>>
>> How many nodes?
>> How much ram per node?
>> What disks and how many?
>> Is your ring balanced?
>> How many column families?
>> How much ram is dedicated to cassandra?
>> What type of caching are you using?
>> What are the sizes of caches?
>> What is the hit rate of caches?
>> What does your disk utiliztion|CPU|Memory look like at peak times?
>> What are your average mean|max row size from cfstats
>> On average for a given sstable how large is the data bloom and index
>> files?
>
>
>
I +1 many of Daniel's points.

Your set up pretty good. I like having 24 GB ram and 12GB used for
JVM. That is the general suggestion. You are getting good hit rate,
but maybe you could get about the same hit rate with small cache and
leaving more memory for VFS cache.

If possible try to get the bulk loading done in off hours. That
SSTable build-up is not a good sign. That likely means that you are in
compaction mode most of the time. are trying to disable compaction
during your bulk loads. That is a suggested tune but if you bulk load
takes a long time and compaction is off then you get that same SSTable
build up so it is a wash.

I see a couple of things:
Your storage to RAM ratio is high. I know RAM is not cheap but, if you
have some laying around. Try bumping a single machine up to 32GB or
higher. Do not change the cache/JVM settings, just add more RAM for
vfs cache and see what type of improvement you get on that node. Every
use case is different but I recently saw another NoSQL presentation
(not cassandra) that was using 128GB ram to manage 1TB data/node! It
would be interesting to hear what other people are doing with respect
to memory/disk size ratio!

I know your drive configuration is another thing that is hard to
change but 7200 RPM drives leave a lot to be desired for in terms of
seek time. Even with a great cache hit rate you are going to have to
move around that RAID0. Have you tested the RAID card and your RAID
setup for iozone bonnie++ etc?

Try --iostat -kd 5 and take one of the second and third samples. We
know your disk is high utilizing but what is that disk capable of?
For reference this is my RAID setup at ~ 81% Utilization
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda             550.60     21497.60      2240.80     107488      11204

Mime
View raw message