cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Tech <jaalex.t...@gmail.com>
Subject Re: Read timeouts on primary key queries
Date Mon, 03 Oct 2016 13:38:00 GMT
Hi All,

Managed to capture the trace for a timed out query using probabilistic
tracing (attached). Seems the timeout is caused by DigestMismatchException
causing a global read repair. I guess this is due
to dclocal_read_repair_chance = 0.0 for the table. Can someone please
confirm this, and that dclocal_read_repair_chance = 0.1 will prevent this?

Thanks,
Joseph



On Thu, Sep 15, 2016 at 6:52 PM, Joseph Tech <jaalex.tech@gmail.com> wrote:

> I added the error logs and see that the timeouts are in a range b/n 2 to
> 7s. Samples below:
>
>  Query error after 5354 ms: [4 bound values] <query>
>  Query error after 6658 ms: [4 bound values] <query>
>  Query error after 4596 ms: [4 bound values] <query>
>  Query error after 2068 ms: [4 bound values] <query>
>  Query error after 2904 ms: [4 bound values] <query>
>
> There is no specific socket timeout set on the client side, so it would
> take the default of 12s. The read_request_timeout_in_ms is set to 5s. In
> this case, how do the errors happen in <5s ? . Is there any other factor
> that would cause a fail-fast scenario during the read?
>
> Thanks,
> Joseph
>
>
>
>
> On Wed, Sep 7, 2016 at 5:26 PM, Joseph Tech <jaalex.tech@gmail.com> wrote:
>
>> Thanks, Romain for the detailed explanation. We use log4j 2 and i have
>> added the driver logging for slow/error queries, will see if it helps to
>> provide any pattern once in Prod.
>>
>> I tried getendpoints and getsstables for some of the timed out keys and
>> most of them listed only 1 SSTable .There were a few which showed 2
>> SSTables. There is no specific trend on the keys, it's completely based on
>> the user access, and the same keys return results instantly from cqlsh
>>
>> On Tue, Sep 6, 2016 at 1:57 PM, Romain Hardouin <romainh_ml@yahoo.fr>
>> wrote:
>>
>>> There is nothing special in the two sstablemetadata outuputs but if the
>>> timeouts are due to a network split or overwhelmed node or something like
>>> that you won't see anything here. That said, if you have the keys which
>>> produced the timeouts then, yes, you can look for a regular pattern (i.e.
>>> always the same keys?).
>>>
>>> You can find sstables for a given key with nodetool:
>>>     nodetool getendpoints <keyspace> <cf> <key>
>>> Then you can run the following command on one/each node of the enpoints:
>>>     nodetool getsstables <keyspace> <cf> <key>
>>>
>>> If many sstables are shown in the previous command it means that your
>>> data is fragmented but thanks to LCS this number should be low.
>>>
>>> I think the most usefull actions now would be:
>>>
>>> * 1) *Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your
>>> log file, you will see the following when errors will occur:
>>>     - Local replicas [<endpoint1>, ...] are insufficient to satisfy
>>> LOCAL_QUORUM requirement of X live nodes in '<dc>'
>>>
>>>     You are using C* 2.1 but you can have a look at the C* 2.2
>>> logback.xml: https://github.com/apache/cassandra/blob/cassan
>>> dra-2.2/conf/logback.xml
>>>     I'm using it on production, it's better because it creates a
>>> separate debug.log file with a asynchronous appender.
>>>
>>>    Watch out when enabling:
>>>
>>>     <appender-ref ref="ASYNCDEBUGLOG" />
>>>
>>>    Because the default logback configuration set all o.a.c in DEBUG:
>>>
>>>         <logger name="org.apache.cassandra" level="DEBUG"/>
>>>
>>>    Instead you can set:
>>>
>>>    <logger name="org.apache.cassandra" level="INFO"/>
>>>    <logger name="org.apache.cassandra.db.ConsistencyLevel"
>>> level="DEBUG"/>
>>>
>>>     Also, if you want to restrict debug.log to DEBUG level only (instead
>>> of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in
>>> logback.xml:
>>>
>>>         <filter class="ch.qos.logback.classic.filter.LevelFilter">
>>>       <level>DEBUG</level>
>>>       <onMatch>ACCEPT</onMatch>
>>>       <onMismatch>DENY</onMismatch>
>>>     </filter>
>>>
>>>   Thus, the debug.log file will be empty unless some Consistency issues
>>> happen.
>>>
>>> * 2) *Enable slow queries log at the driver level with a QueryLogger:
>>>
>>>    Cluster cluster = ...
>>>    // log queries longer than 1 second, see also withDynamicThreshold
>>>    QueryLogger queryLogger = QueryLogger.builder(cluster).w
>>> ithConstantThreshold(1000).build();
>>>    cluster.register(queryLogger);
>>>
>>>     Then in your driver logback file:
>>>
>>>         <logger name="com.datastax.driver.core.QueryLogger.SLOW"
>>> level="DEBUG" />
>>>
>>>  *3) *And/or: you mentioned that you use DSE so you can enable slow
>>> queries logging in dse.yaml (cql_slow_log_options)
>>>
>>> Best,
>>>
>>> Romain
>>>
>>>
>>> Le Lundi 5 septembre 2016 20h05, Joseph Tech <jaalex.tech@gmail.com> a
>>> écrit :
>>>
>>>
>>> Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and
>>> 52 MB (out2). The records are inserted with different TTLs based on their
>>> nature ; test records with 1 day, typeA records with 6 months, typeB
>>> records with 1 year etc. There are also explicit DELETEs from this table,
>>> though it's much lower than the rate of inserts.
>>>
>>> I am not sure how to interpret this output, or if it's the right
>>> SSTables that were picked. Please advise. Is there a way to get the
>>> sstables corresponding to the keys that timed out, though they are
>>> accessible later.
>>>
>>> On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <
>>> anshu.vajpayee@gmail.com> wrote:
>>>
>>> We have seen read time out issue in cassandra due to high droppable
>>> tombstone ratio for repository.
>>>
>>> Please check for high droppable tombstone ratio for your repo.
>>>
>>> On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <romainh_ml@yahoo.fr>
>>> wrote:
>>>
>>> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
>>> latency, so you can swap the values ( https://issues.apache.org/ji
>>> ra/browse/CASSANDRA-7320
>>> <https://issues.apache.org/jira/browse/CASSANDRA-7320> ). I guess the
>>> sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the
>>> default size was way too small: 5 MB. So maybe someone in your company
>>> tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say
>>> to change the value but just keep in mind that you're using a small value
>>> here, it could help you someday.
>>>
>>> Regarding the cells, the histograms shows an *estimation* of the min,
>>> p50, ..., p99, max of cells based on SSTables metadata. On your screenshot,
>>> the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
>>> 1109, so 99% of your partition keys have less than (or equal to) 1109
>>> cells.
>>> You can see these data of a given sstable with the tool sstablemetadata.
>>>
>>> Best,
>>>
>>> Romain
>>>
>>>
>>>
>>> Le Lundi 5 septembre 2016 15h17, Joseph Tech <jaalex.tech@gmail.com> a
>>> écrit :
>>>
>>>
>>> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
>>> won't clog the logs much) . Regarding the table configs, read_repair_chance
>>> must be carried over from older versions - mostly defaults. I think sstable_size_in_mb
>>> was set to limit the max SSTable size, though i am not sure on the reason
>>> for the 50 MB value.
>>>
>>> Does setting dclocal_read_repair_chance help in reducing cross-DC
>>> traffic (haven't looked into this parameter, just going by the name).
>>>
>>> By the cell count definition : is it incremented based on the number of
>>> writes for a given name(key?) and value. This table is heavy on reads and
>>> writes. If so, the value should be much higher?
>>>
>>> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <romainh_ml@yahoo.fr>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
>>> could help to find a regular pattern. By the way, I see that you have set a
>>> global read repair chance:
>>>     read_repair_chance = 0.1
>>> And not the local read repair:
>>>     dclocal_read_repair_chance = 0.0
>>> Is there any reason to do that or is it just the old (pre 2.0.9) default
>>> configuration?
>>>
>>> The cell count is the number of triplets: (name, value, timestamp)
>>>
>>> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
>>> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
>>> read" are good.
>>>
>>> Best,
>>>
>>> Romain
>>>
>>> Le Lundi 5 septembre 2016 13h32, Joseph Tech <jaalex.tech@gmail.com> a
>>> écrit :
>>>
>>>
>>> Hi Ryan,
>>>
>>> Attached are the cfhistograms run within few mins of each other. On the
>>> surface, don't see anything which indicates too much skewing (assuming
>>> skewing ==keys spread across many SSTables) . Please confirm. Related to
>>> this, what does the "cell count" metric indicate ; didn't find a clear
>>> explanation in the documents.
>>>
>>> Thanks,
>>> Joseph
>>>
>>>
>>> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs@foundev.pro> wrote:
>>>
>>> Have you looked at cfhistograms/tablehistograms your data maybe just
>>> skewed (most likely explanation is probably the correct one here)
>>>
>>> Regard,
>>>
>>> Ryan Svihla
>>>
>>> _____________________________
>>> From: Joseph Tech <jaalex.tech@gmail.com>
>>> Sent: Wednesday, August 31, 2016 11:16 PM
>>> Subject: Re: Read timeouts on primary key queries
>>> To: <user@cassandra.apache.org>
>>>
>>>
>>>
>>> Patrick,
>>>
>>> The desc table is below (only col names changed) :
>>>
>>> CREATE TABLE db.tbl (
>>>     id1 text,
>>>     id2 text,
>>>     id3 text,
>>>     id4 text,
>>>     f1 text,
>>>     f2 map<text, text>,
>>>     f3 map<text, text>,
>>>     created timestamp,
>>>     updated timestamp,
>>>     PRIMARY KEY (id1, id2, id3, id4)
>>> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>>>     AND bloom_filter_fp_chance = 0.01
>>>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>>>     AND comment = ''
>>>     AND compaction = {'sstable_size_in_mb': '50', 'class':
>>> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
>>>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
>>> compress.LZ4Compressor'}
>>>     AND dclocal_read_repair_chance = 0.0
>>>     AND default_time_to_live = 0
>>>     AND gc_grace_seconds = 864000
>>>     AND max_index_interval = 2048
>>>     AND memtable_flush_period_in_ms = 0
>>>     AND min_index_interval = 128
>>>     AND read_repair_chance = 0.1
>>>     AND speculative_retry = '99.0PERCENTILE';
>>>
>>> and the query is select * from tbl where id1=? and id2=? and id3=? and
>>> id4=?
>>>
>>> The timeouts happen within ~2s to ~5s, while the successful calls have
>>> avg of 8ms and p99 of 15s. These times are seen from app side, the actual
>>> query times would be slightly lower.
>>>
>>> Is there a way to capture traces only when queries take longer than a
>>> specified duration? . We can't enable tracing in production given the
>>> volume of traffic. We see that the same query which timed out works fine
>>> later, so not sure if the trace of a successful run would help.
>>>
>>> Thanks,
>>> Joseph
>>>
>>>
>>> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pmcfadin@gmail.com>
>>> wrote:
>>>
>>> If you are getting a timeout on one table, then a mismatch of RF and
>>> node count doesn't seem as likely.
>>>
>>> Time to look at your query. You said it was a 'select * from table where
>>> key=?' type query. I would next use the trace facility in cqlsh to
>>> investigate further. That's a good way to find hard to find issues. You
>>> should be looking for clear ledge where you go from single digit ms to 4 or
>>> 5 digit ms times.
>>>
>>> The other place to look is your data model for that table if you want to
>>> post the output from a desc table.
>>>
>>> Patrick
>>>
>>>
>>>
>>> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <jaalex.tech@gmail.com>
>>> wrote:
>>>
>>> On further analysis, this issue happens only on 1 table in the KS which
>>> has the max reads.
>>>
>>> @Atul, I will look at system health, but didnt see anything standing out
>>> from GC logs. (using JDK 1.8_92 with G1GC).
>>>
>>> @Patrick , could you please elaborate the "mismatch on node count + RF"
>>> part.
>>>
>>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <atul.saroha@snapdeal.com>
>>> wrote:
>>>
>>> There could be many reasons for this if it is intermittent. CPU usage +
>>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>>> met that time load. Heap issue if CPU is busy for GC only. Network health
>>> could be the reason. So better to look system health during that time when
>>> it comes.
>>>
>>> ------------------------------ ------------------------------
>>> ------------------------------ ---------------------------
>>> Atul Saroha
>>> *Lead Software Engineer*
>>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>>
>>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <jaalex.tech@gmail.com>
>>> wrote:
>>>
>>> Hi Patrick,
>>>
>>> The nodetool status shows all nodes up and normal now. From OpsCenter
>>> "Event Log" , there are some nodes reported as being down/up etc. during
>>> the timeframe of timeout, but these are Search workload nodes from the
>>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pmcfadin@gmail.com>
>>> wrote:
>>>
>>> You aren't achieving quorum on your reads as the error is explains. That
>>> means you either have some nodes down or your topology is not matching up.
>>> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
>>> on node count + RF.
>>>
>>> What does your nodetool status look like?
>>>
>>> Patrick
>>>
>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <jaalex.tech@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> We recently started getting intermittent timeouts on primary key queries
>>> (select * from table where key=<key>)
>>>
>>> The error is : com.datastax.driver.core.excep
>>> tions.ReadTimeoutException: Cassandra timeout during read query at
>>> consistency LOCAL_QUORUM (2 responses were required but only 1 replica
>>> a responded)
>>>
>>> The same query would work fine when tried directly from cqlsh. There are
>>> no indications in system.log for the table in question, though there were
>>> compactions in progress for tables in another keyspace which is more
>>> frequently accessed.
>>>
>>> My understanding is that the chances of primary key queries timing out
>>> is very minimal. Please share the possible reasons / ways to debug this
>>> issue.
>>>
>>> We are using Cassandra 2.1 (DSE 4.8.7).
>>>
>>> Thanks,
>>> Joseph
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> *Regards,*
>>> *Anshu *
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Mime
View raw message