incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jon Scarborough <...@fifth-aeon.net>
Subject Re: High disk I/O during reads
Date Mon, 25 Mar 2013 17:15:22 GMT
Big thanks to everyone for the suggestions.  Turned out to be a good
opportunity for me to learn more about C* internals and add some extra
debug output to the code to validate some of my assumptions.

Aaron, very good points on the problems with doing the reversed queries.  I
read your blog posts over the weekend, and they're the best description
I've found so far on how the read-path in C* works - they made the code
much easier to understand.  Even though the reversed queries turned out not
to be the primary issue, there are some easy adjustments we can make to
speed things up.  You guess was also correct that we're running m1.large
instances.

Anyway, turns out that Igor's instincts were correct, and the root cause
was a disk read-ahead issue.  On the 4 nodes that were performing poorly it
was set way too high.  With that fix in place we're now seeing read
latencies in the 10ms ballpark, which I think is reasonable for our
hardware setup and use-case.

-Jon

On Sun, Mar 24, 2013 at 10:27 AM, aaron morton <aaron@thelastpickle.com>wrote:

> Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
> xvdap1            0.13         0.00         1.07          0         16
> xvdb            474.20     13524.53        25.33     202868        380
> xvdc            469.87     13455.73        30.40     201836        456
>
> Perchance are you running on m1.large instances ?
> You may be seeing the "moderate" IO performance they offer.
>
>
> The query is looking for a maximum of 50 messages between two dates, in
> reverse order.
>
> This is probably the slowest query you can do. Removing the reverse would
> make it faster, or using a reversed comparator.
> You mentioned that most queries in cfhistograms were hitting one SSTable.
> If you are comfortable with the output from that I would work out the
> distribution for read latency, the read_latency in cfstats if the most
> recent only.
>
> It have a lot of deletes on a row that can slow down reads as well. Does
> not sound like the case but I wanted to mention in.
>
> Next thing to look at is the proxyhistograms, that will give you the
> end-to-end request latency in cassandra. Which will help with identifying
> if the issue is the disk read or network / something else. The something
> else may be DigestMismatch errors if you have been dropping writes and not
> running repair / disable hints.
>
> Cheers
>
>
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 24/03/2013, at 2:59 PM, Matt Kap <matvey1414@gmail.com> wrote:
>
> Having battled similar issues with read latency recently, here's some
> general things to look out for.
>
> - At 118ms, something is definitely broken. You should be looking at
> under 10ms or lower, depending on hardware.
> - Do "nodetool info" on all 5 nodes. Is the load distributed evenly?
> Is it reasonable (under 500GB)?
> - Make sure you aren't running low on heap space. You could see that
> from "nodetool info" also. If you are running low, very bad things
> begin to happen (lots of GC, constant flushing of Memtables, reduction
> of Key Cache, etc). Generally, once there, the node doesn't recover,
> and read latency goes to sh*t.
> - Which compaction strategy are you using? Leveled compactions or
> size-tiered? There's different issues with both.
> - Is your Key Cache turned on? What's the Key Cache hit rate?
> - Is the Read Latency the same on all nodes? Or just one in particular?
> - Are pending compactions building up?
> - What's %util on disk? Same on all nodes?
>
> I would go through "nodetool cfstats, info, compactionstats, tpstats",
> and see if things are roughly the same across all the nodes. You could
> also just be under capacity, but more likely, there's an actual
> problem looming somewhere.
>
> Cheers!
> -Matt
>
> On Sat, Mar 23, 2013 at 3:18 AM,  <igor@4friends.od.ua> wrote:
>
> You can try to disable readahead on cassandra data disk.
>
> Jon Scarborough <jon@fifth-aeon.net> написал(а):
>
>
> Checked tpstats, there are very few dropped messages.
>
> Checked histograms. Mostly nothing surprising. The vast majority of rows
> are small, and most reads only access one or two SSTables.
>
> What I did discover is that of our 5 nodes, one is performing well, with
> disk I/O in the ballprk that seems reasonable. The other 4 nodes are doing
> roughly 4x the disk i/O per second.  Interestingly, the node that is
> performing well also seems to be servicing about twice the number of reads
> that the other nodes are.
>
> I compared configuration between the node performing well to those that
> aren't, and so far haven't found any discrepancies.
>
> On Fri, Mar 22, 2013 at 10:43 AM, Wei Zhu <wz1975@yahoo.com> wrote:
>
>
> According to your cfstats, read latency is over 100 ms which is really
> really slow. I am seeing less than 3ms reads for my cluster which is on
> SSD.
> Can you also check the nodetool cfhistorgram, it tells you more about the
> number of SSTable involved and read/write latency. Somtimes average doesn't
> tell you the whole storey.
> Also check your nodetool tpstats, are there a lot dropped reads?
>
> -Wei
> ----- Original Message -----
> From: "Jon Scarborough" <jon@fifth-aeon.net>
> To: user@cassandra.apache.org
> Sent: Friday, March 22, 2013 9:42:34 AM
> Subject: Re: High disk I/O during reads
>
> Key distribution across probably varies a lot from row to row in our
> case. Most reads would probably only need to look at a few SSTables, a few
> might need to look at more.
>
> I don't yet have a deep understanding of C* internals, but I would
> imagine even the more expensive use cases would involve something like
> this:
>
> 1) Check the index for each SSTable to determine if part of the row is
> there.
> 2) Look at the endpoints of the slice to determine if the data in a
> particular SSTable is relevant to the query.
> 3) Read the chunks of those SSTables, working backwards from the end of
> the slice until enough columns have been read to satisfy the limit clause
> in
> the query.
>
> So I would have guessed that even the more expensive queries on wide rows
> typically wouldn't need to read more than a few hundred KB from disk to do
> all that. Seems like I'm missing something major.
>
> Here's the complete CF definition, including compression settings:
>
> CREATE COLUMNFAMILY conversation_text_message (
> conversation_key bigint PRIMARY KEY
> ) WITH
> comment='' AND
>
>
> comparator='CompositeType(org.apache.cassandra.db.marshal.DateType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.AsciiType,org.apache.cassandra.db.marshal.AsciiType)'
> AND
> read_repair_chance=0.100000 AND
> gc_grace_seconds=864000 AND
> default_validation=text AND
> min_compaction_threshold=4 AND
> max_compaction_threshold=32 AND
> replicate_on_write=True AND
> compaction_strategy_class='SizeTieredCompactionStrategy' AND
>
>
> compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompressor';
>
> Much thanks for any additional ideas.
>
> -Jon
>
>
>
> On Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean < Dean.Hiller@nrel.gov >
> wrote:
>
>
> Did you mean to ask "are 'all' your keys spread across all SSTables"? I
> am guessing at your intention.
>
> I mean I would very well hope my keys are spread across all sstables or
> otherwise that sstable should not be there as he has no keys in it ;).
>
> And I know we had HUGE disk size from the duplication in our sstables on
> size-tiered compaction....we never ran a major compaction but after we
> switched to LCS, we went from 300G to some 120G or something like that
> which
> was nice. We only have 300 data point posts / second so not an extreme
> write
> load on 6 nodes as well though these posts causes read to check
> authorization and such of our system.
>
> Dean
>
> From: Kanwar Sangha < kanwar@mavenir.com <mailto: kanwar@mavenir.com >>
> Reply-To: " user@cassandra.apache.org <mailto: user@cassandra.apache.org
>
> " < user@cassandra.apache.org <mailto: user@cassandra.apache.org >>
>
> Date: Friday, March 22, 2013 8:38 AM
> To: " user@cassandra.apache.org <mailto: user@cassandra.apache.org >" <
> user@cassandra.apache.org <mailto: user@cassandra.apache.org >>
> Subject: RE: High disk I/O during reads
>
>
> Are your Keys spread across all SSTables ? That will cause every sstable
> read which will increase the I/O.
>
> What compaction are you using ?
>
> From: zodiak@fifth-aeon.net <mailto: zodiak@fifth-aeon.net > [mailto:
> zodiak@fifth-aeon.net ] On Behalf Of Jon Scarborough
>
> Sent: 21 March 2013 23:00
> To: user@cassandra.apache.org <mailto: user@cassandra.apache.org >
>
>
> Subject: High disk I/O during reads
>
> Hello,
>
> We've had a 5-node C* cluster (version 1.1.0) running for several months.
> Up until now we've mostly been writing data, but now we're starting to
> service more read traffic. We're seeing far more disk I/O to service these
> reads than I would have anticipated.
>
> The CF being queried consists of chat messages. Each row represents a
> conversation between two people. Each column represents a message. The
> column key is composite, consisting of the message date and a few other
> bits
> of information. The CF is using compression.
>
> The query is looking for a maximum of 50 messages between two dates, in
> reverse order. Usually the two dates used as endpoints are 30 days ago and
> the current time. The query in Astyanax looks like this:
>
> ColumnList<ConversationTextMessageKey> result =
> keyspace.prepareQuery(CF_CONVERSATION_TEXT_MESSAGE)
> .setConsistencyLevel(ConsistencyLevel.CL_QUORUM)
> .getKey(conversationKey)
> .withColumnRange(
> textMessageSerializer.makeEndpoint(endDate,
> Equality.LESS_THAN).toBytes(),
> textMessageSerializer.makeEndpoint(startDate,
> Equality.GREATER_THAN_EQUALS).toBytes(),
> true,
> maxMessages)
> .execute()
> .getResult();
>
> We're currently servicing around 30 of these queries per second.
>
> Here's what the cfstats for the CF look like:
>
> Column Family: conversation_text_message
> SSTable count: 15
> Space used (live): 211762982685
> Space used (total): 211762982685
> Number of Keys (estimate): 330118528
> Memtable Columns Count: 68063
> Memtable Data Size: 53093938
> Memtable Switch Count: 9743
> Read Count: 4313344
> Read Latency: 118.831 ms.
> Write Count: 817876950
> Write Latency: 0.023 ms.
> Pending Tasks: 0
> Bloom Filter False Postives: 6055
> Bloom Filter False Ratio: 0.00260
> Bloom Filter Space Used: 686266048
> Compacted row minimum size: 87
> Compacted row maximum size: 14530764
> Compacted row mean size: 1186
>
> On the C* nodes, iostat output like this is typical, and can spike to be
> much worse:
>
> avg-cpu: %user %nice %system %iowait %steal %idle
> 1.91 0.00 2.08 30.66 0.50 64.84
>
> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
> xvdap1 0.13 0.00 1.07 0 16
> xvdb 474.20 13524.53 25.33 202868 380
> xvdc 469.87 13455.73 30.40 201836 456
> md0 972.13 26980.27 55.73 404704 836
>
> Any thoughts on what could be causing read I/O to the disk from these
> queries?
>
> Much thanks!
>
> -Jon
>
>
>
>
> --
> Отправлено через К-9 Mail. Извините за краткость,
пожалуйста.
>
>
>
>
> --
> www.calcmachine.com - easy online calculator.
>
>
>

Mime
View raw message