Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of duncan.sands@gmail.com
 designates 209.85.212.176 as permitted sender)
Message-ID: <53DCC5DA.8060402@gmail.com>
Date: Sat, 02 Aug 2014 13:04:58 +0200
From: Duncan Sands <duncan.sands@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.0
MIME-Version: 1.0
To: user@cassandra.apache.org
Subject: Re: Occasional read timeouts seen during row scans
References: 
 <CACz-ikuoXAqzPxVKMSeS8tMF3EdMazfr95=xdEVi+CoW6ZkjHw@mail.gmail.com>
 <CACz-iktNuv_7441HhHq++BScYjMg4gKBWBeFanjKTBmwSiqr-A@mail.gmail.com>
In-Reply-To: 
 <CACz-iktNuv_7441HhHq++BScYjMg4gKBWBeFanjKTBmwSiqr-A@mail.gmail.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit

Hi Clint, is time correctly synchronized between your nodes?

Ciao, Duncan.

On 02/08/14 02:12, Clint Kelly wrote:
> BTW a few other details, sorry for omitting these:
>
>   * We are using version 2.0.4 of the Java driver
>   * We are running against Cassandra 2.0.9
>   * I tried messing around with the page size (even reducing it down to a single
>     record) and that didn't seem to help (in the cases where I was observing the
>     timeout)
>
> Best regards,
> Clint
>
>
>
> On Fri, Aug 1, 2014 at 5:02 PM, Clint Kelly <clint.kelly@gmail.com
> <mailto:clint.kelly@gmail.com>> wrote:
>
>     Hi everyone,
>
>     I am seeing occasional read timeouts during multi-row queries, but I'm
>     having difficulty reproducing them or understanding what the problem
>     is.
>
>     First, some background:
>
>     Our team wrote a custom MapReduce InputFormat that looks pretty
>     similar to the DataStax InputFormat except that it allows queries that
>     touch multiple CQL tables with the same PRIMARY KEY format (it then
>     assembles together results from multiple tables for the same primary
>     key before sending them back to the user in the RecordReader).
>
>     During a large batch job in a cluster and during some integration
>     tests, we see errors like the following:
>
>     com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
>     timeout during read query at consistency ONE (1 responses were
>     required but only 0 replica responded)
>
>     Our queries look like this:
>
>     SELECT token(eid_component), eid_component, lg, family, qualifier,
>     version, value FROM "kiji_it0"."t_foo" WHERE lg=? AND family=? AND
>     qualifier=?  AND token(eid_component) >= ? AND token(eid_component) <=
>     ?ALLOW FILTERING;
>
>     Our tables look like the following:
>
>     CREATE TABLE "kiji_it0"."t_foo" (
>       eid_component varchar,
>       lg varchar,
>       family blob,
>       qualifier blob,
>       version bigint,
>       value blob,
>       PRIMARY KEY ((eid_component), lg, family, qualifier, version))
>     WITH CLUSTERING ORDER BY (lg ASC, family ASC, qualifier ASC, version DESC);
>
>     with an additional index on the "lg" column (the lg column is
>     *extremely* low cardinality).
>
>     (FWIW I realize that having "ALLOW FILTERING" is potentially a Very
>     Bad Idea, but we are building a framework on top of Cassandra and
>     MapReduce that allows our users to occasionally make queries like
>     this.  We don't really mind taking a performance hit since these are
>     batch jobs.  We are considering eventually supporting some automatic
>     denormalization, but have not done so yet.)
>
>     If I change the query above to remove the WHERE clauses, the errors go away.
>
>     I think I understand the problem here---there are some rows that have
>     huge amounts of data that we have to scan over, and occasionally those
>     scans take so long that there is a timeout.
>
>     I have a couple of questions:
>
>     1. What parameters in my code or in the Cassandra cluster do I need to
>     adjust to get rid of these timeouts?  Our table layout is designed
>     such that its real-time performance should be pretty good, so I don't
>     mind if the batch queries are a little bit slow.  Do I need to change
>     the read_request_timeout_in_ms parameter?  Or something else?
>
>     2. I have tried to create a test to reproduce this problem, but I have
>     been unable to do so.  Any suggestions on how to do this?  I tried
>     creating a table similar to that described above and filling in a huge
>     amount of data for some rows to try to increase the amount of space
>     that we'd need to skip over.  I also tried reducing
>     read_request_timeout_in_ms from 5000 ms to 50 ms and still no dice.
>
>     Let me know if anyone has any thoughts or suggestions.  At a minimum
>     I'd like to be able to reproduce these read timeout errors in some
>     integration tests.
>
>     Thanks!
>
>     Best regards,
>     Clint
>
>