cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Morton <>
Subject Re: CQL 'IN' predicate
Date Thu, 07 Nov 2013 04:26:15 GMT
> If one big query doesn't cause problems

Every row you read becomes a (roughly) RF number of tasks in the cluster. If you ask for 100
rows in one query it will generate 300 tasks that are processed by the read thread pool which
as a default of 32 threads. If you ask for a lot of rows and the number of nodes in low there
is a chance the client starve others as they wait for all the tasks to be completed. So i
tend to like asking for fewer rows. 


Aaron Morton
New Zealand

Co-Founder & Principal Consultant
Apache Cassandra Consulting

On 7/11/2013, at 12:19 pm, Dan Gould <> wrote:

> Thanks Nate,
> I assume 10k is the return limit.  I don't think I'll ever get close to 10k matches to
the IN query.  That said, you're right: to be safe I'll increase the limit to match the number
of items on the IN.
> I didn't know CQL supported stored procedures, but I'll take a look.  I suppose my question
was asking about parsing overhead, however.  If one big query doesn't cause problems--which
I assume it wouldn't since there can be multiple threads parsing and I assume C* is smart
about memory when accumulating results--I'd much rather do that.
> Dan
> On 11/6/13 3:05 PM, Nate McCall wrote:
>> Unless you explicitly set a page size (i'm pretty sure the query is converted to
a paging query automatically under the hood) you will get capped at the default of 10k which
might get a little weird semantically. That said, you should experiment with explicit page
sizes and see where it gets you (i've not tried this yet with an IN clause - would be real
curious to hear how it worked). 
>> Another thing to consider is that it's a pretty big statement to parse every time.
You might want to go the (much) smaller batch route so these can be stored procedures? (another
thing I havent tried with IN clause - don't see why it would not work though).
>> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <> wrote:
>> I was wondering if anyone had a sense of performance/best practices
>> around the 'IN' predicate.
>> I have a list of up to potentially ~30k keys that I want to look up in a
>> table (typically queries will have <500, but I worry about the long tail).  Most
>> of them will not exist in the table, but, say, about 10-20% will.
>> Would it be best to do:
>> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);
>> 2) Split into smaller batches--
>> for group_of_100 in all_30000:
>>    // ** Issue in parallel or block after each one??
>>    SELECT fields FROM table WHERE id in (group_of_100 uuids);
>> 3) Something else?
>> My guess is that (1) is fine and that the only worry is too much data returned (which
won't be a problem in this case), but I wanted to check that it's not a C* anti-pattern before.
>> [Conversely, is a batch insert with up to 30k items ok?]
>> Thanks,
>> Dan
>> -- 
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting

View raw message