incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Clint Kelly <clint.ke...@gmail.com>
Subject Re: Occasional read timeouts seen during row scans
Date Mon, 04 Aug 2014 16:17:34 GMT
Hi all,

1. I saw this issue in an integration test with a single CassandraDaemon
running, so I don't think it was a time synchronization issue.

2. I did not look in the log for garbage collection issues, but I was able
to reproduce this 100% deterministically, so I think it was an issue having
to do with the organization and size of my data.  I have been unable to fix
this by retrying failed reads (because this behavior, when it occurs, is
deterministic).

I was looking for some kind of guidance on how to tune Cassandra to
increase or decrease this timeout threshold such that I can tolerate a
higher timeout in the cluster and so that I can reproduce this in some unit
or integration tests.

Also if anyone has any ideas on how my particular table layout might lead
to these kinds of problems, that would be great.  Thanks!

Best regards,
Clint




On Sat, Aug 2, 2014 at 4:40 AM, Jack Krupansky <jack@basetechnology.com>
wrote:

> Are you seeing garbage collections in the log at around the same time as
> these occasional timeouts?
>
> Can you identify which requests are timing out? And then can you try some
> of them again and see if they succeed at least sometimes and how long they
> take then?
>
> Do you have a test case that you believe does the worst case for
> filtering? How long does it take?
>
> Can you monitor if the timed-out node is compute bound or I/O bound at the
> times of failure? Do you see spikes for compute or I/O?
>
> Can your app simply retry the timed-out request? Does even a retry
> typically fail, or does retry get you to 100% success? I would note that
> even the best distributed systems do not guarantee zero failures for
> environmental issues, so apps need to tolerate occasional failures.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Duncan Sands
> Sent: Saturday, August 2, 2014 7:04 AM
> To: user@cassandra.apache.org
> Subject: Re: Occasional read timeouts seen during row scans
>
>
> Hi Clint, is time correctly synchronized between your nodes?
>
> Ciao, Duncan.
>
> On 02/08/14 02:12, Clint Kelly wrote:
>
>> BTW a few other details, sorry for omitting these:
>>
>>   * We are using version 2.0.4 of the Java driver
>>   * We are running against Cassandra 2.0.9
>>   * I tried messing around with the page size (even reducing it down to a
>> single
>>     record) and that didn't seem to help (in the cases where I was
>> observing the
>>     timeout)
>>
>> Best regards,
>> Clint
>>
>>
>>
>> On Fri, Aug 1, 2014 at 5:02 PM, Clint Kelly <clint.kelly@gmail.com
>> <mailto:clint.kelly@gmail.com>> wrote:
>>
>>     Hi everyone,
>>
>>     I am seeing occasional read timeouts during multi-row queries, but I'm
>>     having difficulty reproducing them or understanding what the problem
>>     is.
>>
>>     First, some background:
>>
>>     Our team wrote a custom MapReduce InputFormat that looks pretty
>>     similar to the DataStax InputFormat except that it allows queries that
>>     touch multiple CQL tables with the same PRIMARY KEY format (it then
>>     assembles together results from multiple tables for the same primary
>>     key before sending them back to the user in the RecordReader).
>>
>>     During a large batch job in a cluster and during some integration
>>     tests, we see errors like the following:
>>
>>     com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
>>     timeout during read query at consistency ONE (1 responses were
>>     required but only 0 replica responded)
>>
>>     Our queries look like this:
>>
>>     SELECT token(eid_component), eid_component, lg, family, qualifier,
>>     version, value FROM "kiji_it0"."t_foo" WHERE lg=? AND family=? AND
>>     qualifier=?  AND token(eid_component) >= ? AND token(eid_component) <=
>>     ?ALLOW FILTERING;
>>
>>     Our tables look like the following:
>>
>>     CREATE TABLE "kiji_it0"."t_foo" (
>>       eid_component varchar,
>>       lg varchar,
>>       family blob,
>>       qualifier blob,
>>       version bigint,
>>       value blob,
>>       PRIMARY KEY ((eid_component), lg, family, qualifier, version))
>>     WITH CLUSTERING ORDER BY (lg ASC, family ASC, qualifier ASC, version
>> DESC);
>>
>>     with an additional index on the "lg" column (the lg column is
>>     *extremely* low cardinality).
>>
>>     (FWIW I realize that having "ALLOW FILTERING" is potentially a Very
>>     Bad Idea, but we are building a framework on top of Cassandra and
>>     MapReduce that allows our users to occasionally make queries like
>>     this.  We don't really mind taking a performance hit since these are
>>     batch jobs.  We are considering eventually supporting some automatic
>>     denormalization, but have not done so yet.)
>>
>>     If I change the query above to remove the WHERE clauses, the errors
>> go away.
>>
>>     I think I understand the problem here---there are some rows that have
>>     huge amounts of data that we have to scan over, and occasionally those
>>     scans take so long that there is a timeout.
>>
>>     I have a couple of questions:
>>
>>     1. What parameters in my code or in the Cassandra cluster do I need to
>>     adjust to get rid of these timeouts?  Our table layout is designed
>>     such that its real-time performance should be pretty good, so I don't
>>     mind if the batch queries are a little bit slow.  Do I need to change
>>     the read_request_timeout_in_ms parameter?  Or something else?
>>
>>     2. I have tried to create a test to reproduce this problem, but I have
>>     been unable to do so.  Any suggestions on how to do this?  I tried
>>     creating a table similar to that described above and filling in a huge
>>     amount of data for some rows to try to increase the amount of space
>>     that we'd need to skip over.  I also tried reducing
>>     read_request_timeout_in_ms from 5000 ms to 50 ms and still no dice.
>>
>>     Let me know if anyone has any thoughts or suggestions.  At a minimum
>>     I'd like to be able to reproduce these read timeout errors in some
>>     integration tests.
>>
>>     Thanks!
>>
>>     Best regards,
>>     Clint
>>
>>
>>
>

Mime
View raw message