From user-return-37440-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Wed Nov 6 23:20:19 2013 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 047CF10DF5 for ; Wed, 6 Nov 2013 23:20:19 +0000 (UTC) Received: (qmail 20419 invoked by uid 500); 6 Nov 2013 23:20:16 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 20374 invoked by uid 500); 6 Nov 2013 23:20:16 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 20366 invoked by uid 99); 6 Nov 2013 23:20:16 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Nov 2013 23:20:16 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dan@chill.com designates 209.85.160.49 as permitted sender) Received: from [209.85.160.49] (HELO mail-pb0-f49.google.com) (209.85.160.49) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 Nov 2013 23:20:09 +0000 Received: by mail-pb0-f49.google.com with SMTP id um15so203866pbc.22 for ; Wed, 06 Nov 2013 15:19:47 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:message-id:date:from:user-agent:mime-version:to :subject:references:in-reply-to:content-type; bh=A/uz3otx3emBN+InM96lPBa2dB+j1HXRCLtwdnDYk6M=; b=kjXY3nMCEwVeKwkg7bjlBJ29r+A/Z0Z2Fc88IPDbz474D3y533g0zTpHR8EM3IWs2e zhIk+x/kJPNvhRPujhAFq8y4H2QSnwR0swz0dS9iHAGPSGsHkd3UkpDVKE+gkknzWLhR 5ZIqLGtNc0ZyAQLHc6Mniwek8i1/MU9TUzmB4H7PKR/y5BlZnQirgcEUmM7r+/vSMkBV SuX0YbQG/hvHZ7DiZPpLGivHvLnW1ob1KA3xlVuW1gT01LLKi8NjJWejN5G0BLuNDKFd e7zxR/FGPuz1Qcmc7SMLHoOE+GxC3NnaXGYXoACNBTcioOJYLuyTEzoY1NICHy4+4tXU qO5A== X-Gm-Message-State: ALoCoQm0WNDBXxk1v0DbUCPThka7U4CXc03aKdG6qz19K+rSJU1YvaOq6c6LSJqo5rjE9CC2zcM+ X-Received: by 10.66.146.199 with SMTP id te7mr6550312pab.106.1383779987776; Wed, 06 Nov 2013 15:19:47 -0800 (PST) Received: from Daniels-MacBook-Pro-2.local (adsl-69-231-206-116.dsl.irvnca.pacbell.net. [69.231.206.116]) by mx.google.com with ESMTPSA id ye1sm1259482pab.19.2013.11.06.15.19.46 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 06 Nov 2013 15:19:47 -0800 (PST) Message-ID: <527ACE94.9030402@chill.com> Date: Wed, 06 Nov 2013 15:19:48 -0800 From: Dan Gould User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Thunderbird/24.1.0 MIME-Version: 1.0 To: user@cassandra.apache.org Subject: Re: CQL 'IN' predicate References: <527ABDFA.40607@chill.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------020906020501050301070709" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------020906020501050301070709 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Thanks Nate, I assume 10k is the return limit. I don't think I'll ever get close to 10k matches to the IN query. That said, you're right: to be safe I'll increase the limit to match the number of items on the IN. I didn't know CQL supported stored procedures, but I'll take a look. I suppose my question was asking about parsing overhead, however. If one big query doesn't cause problems--which I assume it wouldn't since there can be multiple threads parsing and I assume C* is smart about memory when accumulating results--I'd much rather do that. Dan On 11/6/13 3:05 PM, Nate McCall wrote: > Unless you explicitly set a page size (i'm pretty sure the query is > converted to a paging query automatically under the hood) you will get > capped at the default of 10k which might get a little weird > semantically. That said, you should experiment with explicit page > sizes and see where it gets you (i've not tried this yet with an IN > clause - would be real curious to hear how it worked). > > Another thing to consider is that it's a pretty big statement to parse > every time. You might want to go the (much) smaller batch route so > these can be stored procedures? (another thing I havent tried with IN > clause - don't see why it would not work though). > > > > > On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould > wrote: > > I was wondering if anyone had a sense of performance/best practices > around the 'IN' predicate. > > I have a list of up to potentially ~30k keys that I want to look > up in a > table (typically queries will have <500, but I worry about the > long tail). Most > of them will not exist in the table, but, say, about 10-20% will. > > Would it be best to do: > > 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... > uuid30000); > > 2) Split into smaller batches-- > for group_of_100 in all_30000: > // ** Issue in parallel or block after each one?? > SELECT fields FROM table WHERE id in (group_of_100 uuids); > > 3) Something else? > > My guess is that (1) is fine and that the only worry is too much > data returned (which won't be a problem in this case), but I > wanted to check that it's not a C* anti-pattern before. > > [Conversely, is a batch insert with up to 30k items ok?] > > Thanks, > Dan > > > > > -- > ----------------- > Nate McCall > Austin, TX > @zznate > > Co-Founder & Sr. Technical Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com --------------020906020501050301070709 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Thanks Nate,

I assume 10k is the return limit.  I don't think I'll ever get close to 10k matches to the IN query.  That said, you're right: to be safe I'll increase the limit to match the number of items on the IN.

I didn't know CQL supported stored procedures, but I'll take a look.  I suppose my question was asking about parsing overhead, however.  If one big query doesn't cause problems--which I assume it wouldn't since there can be multiple threads parsing and I assume C* is smart about memory when accumulating results--I'd much rather do that.

Dan

On 11/6/13 3:05 PM, Nate McCall wrote:
Unless you explicitly set a page size (i'm pretty sure the query is converted to a paging query automatically under the hood) you will get capped at the default of 10k which might get a little weird semantically. That said, you should experiment with explicit page sizes and see where it gets you (i've not tried this yet with an IN clause - would be real curious to hear how it worked). 

Another thing to consider is that it's a pretty big statement to parse every time. You might want to go the (much) smaller batch route so these can be stored procedures? (another thing I havent tried with IN clause - don't see why it would not work though).




On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <dan@chill.com> wrote:
I was wondering if anyone had a sense of performance/best practices
around the 'IN' predicate.

I have a list of up to potentially ~30k keys that I want to look up in a
table (typically queries will have <500, but I worry about the long tail).  Most
of them will not exist in the table, but, say, about 10-20% will.

Would it be best to do:

1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);

2) Split into smaller batches--
for group_of_100 in all_30000:
   // ** Issue in parallel or block after each one??
   SELECT fields FROM table WHERE id in (group_of_100 uuids);

3) Something else?

My guess is that (1) is fine and that the only worry is too much data returned (which won't be a problem in this case), but I wanted to check that it's not a C* anti-pattern before.

[Conversely, is a batch insert with up to 30k items ok?]

Thanks,
Dan




--
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

--------------020906020501050301070709--