cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Tech <jaalex.t...@gmail.com>
Subject Re: Read timeouts on primary key queries
Date Mon, 05 Sep 2016 13:10:31 GMT
Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't
clog the logs much) . Regarding the table configs, read_repair_chance must
be carried over from older versions - mostly defaults. I think
sstable_size_in_mb
was set to limit the max SSTable size, though i am not sure on the reason
for the 50 MB value.

Does setting dclocal_read_repair_chance help in reducing cross-DC traffic
(haven't looked into this parameter, just going by the name).

By the cell count definition : is it incremented based on the number of
writes for a given name(key?) and value. This table is heavy on reads and
writes. If so, the value should be much higher?

On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh_ml@yahoo.fr> wrote:

> Hi,
>
> Try to put org.apache.cassandra.db.ConsistencyLevel at DEBUG level, it
> could help to find a regular pattern. By the way, I see that you have set a
> global read repair chance:
>     read_repair_chance = 0.1
> And not the local read repair:
>     dclocal_read_repair_chance = 0.0
> Is there any reason to do that or is it just the old (pre 2.0.9) default
> configuration?
>
> The cell count is the number of triplets: (name, value, timestamp)
>
> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
> read" are good.
>
> Best,
>
> Romain
>
> Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.tech@gmail.com> a
> écrit :
>
>
> Hi Ryan,
>
> Attached are the cfhistograms run within few mins of each other. On the
> surface, don't see anything which indicates too much skewing (assuming
> skewing ==keys spread across many SSTables) . Please confirm. Related to
> this, what does the "cell count" metric indicate ; didn't find a clear
> explanation in the documents.
>
> Thanks,
> Joseph
>
>
> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs@foundev.pro> wrote:
>
> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _____________________________
> From: Joseph Tech <jaalex.tech@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <user@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
>     id1 text,
>     id2 text,
>     id3 text,
>     id4 text,
>     f1 text,
>     f2 map<text, text>,
>     f3 map<text, text>,
>     created timestamp,
>     updated timestamp,
>     PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.0
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.1
>     AND speculative_retry = '99.0PERCENTILE';
>
> and the query is select * from tbl where id1=? and id2=? and id3=? and
> id4=?
>
> The timeouts happen within ~2s to ~5s, while the successful calls have avg
> of 8ms and p99 of 15s. These times are seen from app side, the actual query
> times would be slightly lower.
>
> Is there a way to capture traces only when queries take longer than a
> specified duration? . We can't enable tracing in production given the
> volume of traffic. We see that the same query which timed out works fine
> later, so not sure if the trace of a successful run would help.
>
> Thanks,
> Joseph
>
>
> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfadin@gmail.com>
> wrote:
>
> If you are getting a timeout on one table, then a mismatch of RF and node
> count doesn't seem as likely.
>
> Time to look at your query. You said it was a 'select * from table where
> key=?' type query. I would next use the trace facility in cqlsh to
> investigate further. That's a good way to find hard to find issues. You
> should be looking for clear ledge where you go from single digit ms to 4 or
> 5 digit ms times.
>
> The other place to look is your data model for that table if you want to
> post the output from a desc table.
>
> Patrick
>
>
>
> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.tech@gmail.com>
> wrote:
>
> On further analysis, this issue happens only on 1 table in the KS which
> has the max reads.
>
> @Atul, I will look at system health, but didnt see anything standing out
> from GC logs. (using JDK 1.8_92 with G1GC).
>
> @Patrick , could you please elaborate the "mismatch on node count + RF"
> part.
>
> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.saroha@snapdeal.com>
> wrote:
>
> There could be many reasons for this if it is intermittent. CPU usage +
> I/O wait status. As read are I/O intensive, your IOPS requirement should be
> met that time load. Heap issue if CPU is busy for GC only. Network health
> could be the reason. So better to look system health during that time when
> it comes.
>
> ------------------------------ ------------------------------
> ------------------------------ ---------------------------
> Atul Saroha
> *Lead Software Engineer*
> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>
> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.tech@gmail.com>
> wrote:
>
> Hi Patrick,
>
> The nodetool status shows all nodes up and normal now. From OpsCenter
> "Event Log" , there are some nodes reported as being down/up etc. during
> the timeframe of timeout, but these are Search workload nodes from the
> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>
> Thanks,
> Joseph
>
> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pmcfadin@gmail.com>
> wrote:
>
> You aren't achieving quorum on your reads as the error is explains. That
> means you either have some nodes down or your topology is not matching up.
> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
> on node count + RF.
>
> What does your nodetool status look like?
>
> Patrick
>
> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <jaalex.tech@gmail.com>
> wrote:
>
> Hi,
>
> We recently started getting intermittent timeouts on primary key queries
> (select * from table where key=<key>)
>
> The error is : com.datastax.driver.core.excep tions.ReadTimeoutException:
> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica
> a responded)
>
> The same query would work fine when tried directly from cqlsh. There are
> no indications in system.log for the table in question, though there were
> compactions in progress for tables in another keyspace which is more
> frequently accessed.
>
> My understanding is that the chances of primary key queries timing out is
> very minimal. Please share the possible reasons / ways to debug this issue.
>
> We are using Cassandra 2.1 (DSE 4.8.7).
>
> Thanks,
> Joseph
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Mime
View raw message