Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C4513106D3 for ; Thu, 7 Nov 2013 04:26:50 +0000 (UTC) Received: (qmail 59558 invoked by uid 500); 7 Nov 2013 04:26:47 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 59472 invoked by uid 500); 7 Nov 2013 04:26:45 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 59464 invoked by uid 99); 7 Nov 2013 04:26:44 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Nov 2013 04:26:44 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.220.48] (HELO mail-pa0-f48.google.com) (209.85.220.48) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Nov 2013 04:26:38 +0000 Received: by mail-pa0-f48.google.com with SMTP id kq14so16848pab.21 for ; Wed, 06 Nov 2013 20:26:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:content-type:message-id:mime-version :subject:date:references:to:in-reply-to; bh=OxMffPVpeOKJrT8QeiUCcJRSAuovnzXFIxVsRBkT/pY=; b=G9Nyq9Xvd1XTeL4+7BZDkWGiCvIbgsEWP90lfAF1qTQrh5UxcaaV82SKpjmvaWXqF8 IhwfQyOfFRlMk7QkkND+GqouZNpeU/5QSpgFPHkA47Qk+YX246tYLOT11M6ubb+UzJCF 2V0aBviAR2y2CnDaA8Imf9/ZPdNpYDBvdCIqjR76iYtddFhTgcdVySIsCBmNJb5QCh/3 7pRt7dJCm2nSxGh02UsyoaN66QWhJHMRrn+rvZ3yeCDXfD2rVxi8kDO5W4BQOLyWzPrP djjHN9yACmbDGrM5fYKGcMWciTccrCRRyLP5FaLjOB/CYceQwaYMLqODu594DsmKAw0m iTpQ== X-Gm-Message-State: ALoCoQlVFjTbl7Xn+Aceeyi1tmFfDRK9hffUwRX7JjrXAFGKhTT47fbusUNE7DmkJkrAqq4lfiaJ X-Received: by 10.68.135.66 with SMTP id pq2mr6764785pbb.65.1383798376215; Wed, 06 Nov 2013 20:26:16 -0800 (PST) Received: from [172.16.1.20] ([203.86.207.101]) by mx.google.com with ESMTPSA id hi5sm1727941pbb.43.2013.11.06.20.26.14 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 06 Nov 2013 20:26:15 -0800 (PST) From: Aaron Morton Content-Type: multipart/alternative; boundary="Apple-Mail=_D744F110-F7D7-4A8A-AE14-12E0906DA6D6" Message-Id: <36C0CA6E-EA29-4D39-ACA5-BD4C236A3665@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\)) Subject: Re: CQL 'IN' predicate Date: Thu, 7 Nov 2013 17:26:15 +1300 References: <527ABDFA.40607@chill.com> <527ACE94.9030402@chill.com> To: Cassandra User In-Reply-To: <527ACE94.9030402@chill.com> X-Mailer: Apple Mail (2.1816) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_D744F110-F7D7-4A8A-AE14-12E0906DA6D6 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 > If one big query doesn't cause problems Every row you read becomes a (roughly) RF number of tasks in the = cluster. If you ask for 100 rows in one query it will generate 300 tasks = that are processed by the read thread pool which as a default of 32 = threads. If you ask for a lot of rows and the number of nodes in low = there is a chance the client starve others as they wait for all the = tasks to be completed. So i tend to like asking for fewer rows.=20 Cheers ----------------- Aaron Morton New Zealand @aaronmorton Co-Founder & Principal Consultant Apache Cassandra Consulting http://www.thelastpickle.com On 7/11/2013, at 12:19 pm, Dan Gould wrote: > Thanks Nate, >=20 > I assume 10k is the return limit. I don't think I'll ever get close = to 10k matches to the IN query. That said, you're right: to be safe = I'll increase the limit to match the number of items on the IN. >=20 > I didn't know CQL supported stored procedures, but I'll take a look. = I suppose my question was asking about parsing overhead, however. If = one big query doesn't cause problems--which I assume it wouldn't since = there can be multiple threads parsing and I assume C* is smart about = memory when accumulating results--I'd much rather do that. >=20 > Dan >=20 > On 11/6/13 3:05 PM, Nate McCall wrote: >> Unless you explicitly set a page size (i'm pretty sure the query is = converted to a paging query automatically under the hood) you will get = capped at the default of 10k which might get a little weird = semantically. That said, you should experiment with explicit page sizes = and see where it gets you (i've not tried this yet with an IN clause - = would be real curious to hear how it worked).=20 >>=20 >> Another thing to consider is that it's a pretty big statement to = parse every time. You might want to go the (much) smaller batch route so = these can be stored procedures? (another thing I havent tried with IN = clause - don't see why it would not work though). >>=20 >>=20 >>=20 >>=20 >> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould wrote: >> I was wondering if anyone had a sense of performance/best practices >> around the 'IN' predicate. >>=20 >> I have a list of up to potentially ~30k keys that I want to look up = in a >> table (typically queries will have <500, but I worry about the long = tail). Most >> of them will not exist in the table, but, say, about 10-20% will. >>=20 >> Would it be best to do: >>=20 >> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... = uuid30000); >>=20 >> 2) Split into smaller batches-- >> for group_of_100 in all_30000: >> // ** Issue in parallel or block after each one?? >> SELECT fields FROM table WHERE id in (group_of_100 uuids); >>=20 >> 3) Something else? >>=20 >> My guess is that (1) is fine and that the only worry is too much data = returned (which won't be a problem in this case), but I wanted to check = that it's not a C* anti-pattern before. >>=20 >> [Conversely, is a batch insert with up to 30k items ok?] >>=20 >> Thanks, >> Dan >>=20 >>=20 >>=20 >>=20 >> --=20 >> ----------------- >> Nate McCall >> Austin, TX >> @zznate >>=20 >> Co-Founder & Sr. Technical Consultant >> Apache Cassandra Consulting >> http://www.thelastpickle.com >=20 --Apple-Mail=_D744F110-F7D7-4A8A-AE14-12E0906DA6D6 Content-Transfer-Encoding: 7bit Content-Type: text/html; charset=iso-8859-1
If one big query doesn't cause problems

Every row you read becomes a (roughly) RF number of tasks in the cluster. If you ask for 100 rows in one query it will generate 300 tasks that are processed by the read thread pool which as a default of 32 threads. If you ask for a lot of rows and the number of nodes in low there is a chance the client starve others as they wait for all the tasks to be completed. So i tend to like asking for fewer rows. 

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting

On 7/11/2013, at 12:19 pm, Dan Gould <dan@chill.com> wrote:

Thanks Nate,

I assume 10k is the return limit.  I don't think I'll ever get close to 10k matches to the IN query.  That said, you're right: to be safe I'll increase the limit to match the number of items on the IN.

I didn't know CQL supported stored procedures, but I'll take a look.  I suppose my question was asking about parsing overhead, however.  If one big query doesn't cause problems--which I assume it wouldn't since there can be multiple threads parsing and I assume C* is smart about memory when accumulating results--I'd much rather do that.

Dan

On 11/6/13 3:05 PM, Nate McCall wrote:
Unless you explicitly set a page size (i'm pretty sure the query is converted to a paging query automatically under the hood) you will get capped at the default of 10k which might get a little weird semantically. That said, you should experiment with explicit page sizes and see where it gets you (i've not tried this yet with an IN clause - would be real curious to hear how it worked). 

Another thing to consider is that it's a pretty big statement to parse every time. You might want to go the (much) smaller batch route so these can be stored procedures? (another thing I havent tried with IN clause - don't see why it would not work though).




On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <dan@chill.com> wrote:
I was wondering if anyone had a sense of performance/best practices
around the 'IN' predicate.

I have a list of up to potentially ~30k keys that I want to look up in a
table (typically queries will have <500, but I worry about the long tail).  Most
of them will not exist in the table, but, say, about 10-20% will.

Would it be best to do:

1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... uuid30000);

2) Split into smaller batches--
for group_of_100 in all_30000:
   // ** Issue in parallel or block after each one??
   SELECT fields FROM table WHERE id in (group_of_100 uuids);

3) Something else?

My guess is that (1) is fine and that the only worry is too much data returned (which won't be a problem in this case), but I wanted to check that it's not a C* anti-pattern before.

[Conversely, is a batch insert with up to 30k items ok?]

Thanks,
Dan




--
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


--Apple-Mail=_D744F110-F7D7-4A8A-AE14-12E0906DA6D6--