cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ian Rose <ianr...@fullstory.com>
Subject Re: range query times out (on 1 node, just 1 row in table)
Date Wed, 13 Aug 2014 14:33:54 GMT
Frankly, no matter how inefficient / expensive the query is, surely it
should still work when there is only 1 row and 1 node (which is localhost)!

I'm starting to wonder if range queries on secondary indexes aren't
supported at all (although if that is the case, I would certainly prefer an
error rather than a timeout!).  I've been scouring the web trying to find a
definitive answer on this but all I have come up with is this (old,
non-authoritative) blog post which states "Cassandra’s native index  is
like a hashed index, which means you can only do equality query and not
range query."

http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/




On Wed, Aug 13, 2014 at 10:27 AM, DuyHai Doan <doanduyhai@gmail.com> wrote:

> It does not matter that this table has one row or n rows. Before fetching
> data in the table foo, C* must determine:
>
> 1) how many primary keys of table "foo" match the condition foo_name='dave'
> --> read from the 2nd index "foo_name" where partition key = "dave"
> 2) how many primary keys of table "foo" match the condition int_val>0 --> read
> from the 2nd index "int_val" where partition key > 0, so basically it is a
> range scan
>
> Once it gets all the results from 2nd indices, C* can query the primary
> table to return data.
>
>  I've read somewhere that when having multiple conditions in the WHERE
> clause, C* should use the most restrictive condition to optimize
> performance. In our example, equality condition on "foo_name" seems to be
> the most restrictive.
>
>  My assumption is that C* does use statistics to determine the most
> restrictive condition and since here we have only 1 data, statictics are
> useless so it ends up doing a range scan on int_val ....
>
>  It would be nice if someone can confirm/infirm the assumption. The last
> time I sneaked into the source code of 2nd index was more than 6 months ago
> so things may have changed since then
>
>
>
>
> On Wed, Aug 13, 2014 at 3:29 PM, Jack Krupansky <jack@basetechnology.com>
> wrote:
>
>>   Agreed, but... in this case the table has ONE row, so what exactly
>> could be causing this timeout? I mean, it can’t be the row count, right?
>>
>> -- Jack Krupansky
>>
>>  *From:* DuyHai Doan <doanduyhai@gmail.com>
>> *Sent:* Wednesday, August 13, 2014 9:01 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: range query times out (on 1 node, just 1 row in table)
>>
>>  Hello Ian
>>
>> Secondary index performs poorly with inequalities (<, ≤, >, ≥). Indeed
>> inequalities forces the server to scan all the cluster to find the
>> requested range, which is clearly not optimal. That's the reason why you
>> need to add "ALLOW FILTERING" for the query to be accepted.
>>
>> "ALLOW FILTERING" means "beware of what you're doing, we C* developers do
>> not give any guarantee about performance of such query".
>>
>> As Robert Coli used to say on this list, ALLOW FILTERING is synonym to
>> PROBABLY TIMEOUT :D
>>
>>
>> On Wed, Aug 13, 2014 at 2:56 PM, Ian Rose <ianrose@fullstory.com> wrote:
>>
>>> Confusingly, it appears to be the presence of an index on int_val that
>>> is causing this timeout.  If I drop that index (leaving only the index on
>>> foo_name) the query works just fine.
>>>
>>>
>>> On Tue, Aug 12, 2014 at 10:25 PM, Ian Rose <ianrose@fullstory.com>
>>> wrote:
>>>
>>>> Hi -
>>>>
>>>> I am currently running a single Cassandra node on my local dev
>>>> machine.  Here is my (test) schema (which is meaningless, I created it just
>>>> to demonstrate the issue I am running into):
>>>>
>>>>  CREATE TABLE foo (
>>>>   foo_name ascii,
>>>>   foo_shard bigint,
>>>>   int_val bigint,
>>>>   PRIMARY KEY ((foo_name, foo_shard))
>>>> ) WITH read_repair_chance=0.1;
>>>>
>>>> CREATE INDEX ON foo (int_val);
>>>> CREATE INDEX ON foo (foo_name);
>>>>
>>>> I have inserted just a single row into this table:
>>>> insert into foo(foo_name, foo_shard, int_val) values('dave', 27, 100);
>>>>
>>>> This query works fine:
>>>> select * from foo where foo_name='dave';
>>>>
>>>> But when I run this query, I get an RPC timeout:
>>>> select * from foo where foo_name='dave' and int_val > 0 allow filtering;
>>>>
>>>> With tracing enabled, here is the trace output:
>>>> http://pastebin.com/raw.php?i=6XMEVUcQ
>>>>
>>>> (In short, everything looks fine to my untrained eye until 10s elapsed,
>>>> at which time the following event is logged: "Timed out; received 0 of 1
>>>> responses for range 257 of 257")
>>>>
>>>> Can anyone help interpret this error?
>>>>
>>>> Many thanks!
>>>> Ian
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Mime
View raw message