Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
From: Aaron Morton <aaron@thelastpickle.com>
Content-Type: multipart/alternative;
 boundary="Apple-Mail=_D744F110-F7D7-4A8A-AE14-12E0906DA6D6"
Message-Id: <36C0CA6E-EA29-4D39-ACA5-BD4C236A3665@thelastpickle.com>
Mime-Version: 1.0 (Mac OS X Mail 7.0 \(1816\))
Subject: Re: CQL 'IN' predicate
Date: Thu, 7 Nov 2013 17:26:15 +1300
References: <527ABDFA.40607@chill.com>
 <CAKmMYa8oYM+vvihSw8SYyhXC-8DPJqjpJVX8eH--YYtSG6SZOQ@mail.gmail.com>
 <527ACE94.9030402@chill.com>
To: Cassandra User <user@cassandra.apache.org>
In-Reply-To: <527ACE94.9030402@chill.com>


--Apple-Mail=_D744F110-F7D7-4A8A-AE14-12E0906DA6D6
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=iso-8859-1

> If one big query doesn't cause problems

Every row you read becomes a (roughly) RF number of tasks in the =
cluster. If you ask for 100 rows in one query it will generate 300 tasks =
that are processed by the read thread pool which as a default of 32 =
threads. If you ask for a lot of rows and the number of nodes in low =
there is a chance the client starve others as they wait for all the =
tasks to be completed. So i tend to like asking for fewer rows.=20

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 7/11/2013, at 12:19 pm, Dan Gould <dan@chill.com> wrote:

> Thanks Nate,
>=20
> I assume 10k is the return limit.  I don't think I'll ever get close =
to 10k matches to the IN query.  That said, you're right: to be safe =
I'll increase the limit to match the number of items on the IN.
>=20
> I didn't know CQL supported stored procedures, but I'll take a look.  =
I suppose my question was asking about parsing overhead, however.  If =
one big query doesn't cause problems--which I assume it wouldn't since =
there can be multiple threads parsing and I assume C* is smart about =
memory when accumulating results--I'd much rather do that.
>=20
> Dan
>=20
> On 11/6/13 3:05 PM, Nate McCall wrote:
>> Unless you explicitly set a page size (i'm pretty sure the query is =
converted to a paging query automatically under the hood) you will get =
capped at the default of 10k which might get a little weird =
semantically. That said, you should experiment with explicit page sizes =
and see where it gets you (i've not tried this yet with an IN clause - =
would be real curious to hear how it worked).=20
>>=20
>> Another thing to consider is that it's a pretty big statement to =
parse every time. You might want to go the (much) smaller batch route so =
these can be stored procedures? (another thing I havent tried with IN =
clause - don't see why it would not work though).
>>=20
>>=20
>>=20
>>=20
>> On Wed, Nov 6, 2013 at 4:08 PM, Dan Gould <dan@chill.com> wrote:
>> I was wondering if anyone had a sense of performance/best practices
>> around the 'IN' predicate.
>>=20
>> I have a list of up to potentially ~30k keys that I want to look up =
in a
>> table (typically queries will have <500, but I worry about the long =
tail).  Most
>> of them will not exist in the table, but, say, about 10-20% will.
>>=20
>> Would it be best to do:
>>=20
>> 1) SELECT fields FROM table WHERE id in (uuid1, uuid2, ...... =
uuid30000);
>>=20
>> 2) Split into smaller batches--
>> for group_of_100 in all_30000:
>>    // ** Issue in parallel or block after each one??
>>    SELECT fields FROM table WHERE id in (group_of_100 uuids);
>>=20
>> 3) Something else?
>>=20
>> My guess is that (1) is fine and that the only worry is too much data =
returned (which won't be a problem in this case), but I wanted to check =
that it's not a C* anti-pattern before.
>>=20
>> [Conversely, is a batch insert with up to 30k items ok?]
>>=20
>> Thanks,
>> Dan
>>=20
>>=20
>>=20
>>=20
>> --=20
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>>=20
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>=20


--Apple-Mail=_D744F110-F7D7-4A8A-AE14-12E0906DA6D6
Content-Transfer-Encoding: 7bit
Content-Type: text/html;
	charset=iso-8859-1

<html><head><meta http-equiv="Content-Type" content="text/html charset=iso-8859-1"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div><blockquote type="cite">If one big query doesn't cause problems</blockquote><br></div><div>Every row you read becomes a (roughly) RF number of tasks in the cluster. If you ask for 100 rows in one query it will generate 300 tasks that are processed by the read thread pool which as a default of 32 threads. If you ask for a lot of rows and the number of nodes in low there is a chance the client starve others as they wait for all the tasks to be completed. So i tend to like asking for fewer rows.&nbsp;</div><div><br></div><div>Cheers</div><div><br><div apple-content-edited="true">
<div style="color: rgb(0, 0, 0); font-family: Helvetica;  font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div>-----------------</div><div>Aaron Morton</div><div>New Zealand</div><div>@aaronmorton</div><div><br></div><div>Co-Founder &amp; Principal Consultant</div><div>Apache Cassandra Consulting</div><div><a href="http://www.thelastpickle.com">http://www.thelastpickle.com</a></div></div>
</div>
<br><div><div>On 7/11/2013, at 12:19 pm, Dan Gould &lt;<a href="mailto:dan@chill.com">dan@chill.com</a>&gt; wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite">
  
    <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
  
  <div bgcolor="#FFFFFF" text="#000000">
    Thanks Nate,<br>
    <br>
    I assume 10k is the return limit.&nbsp; I don't think I'll ever get close
    to 10k matches to the IN query.&nbsp; That said, you're right: to be safe
    I'll increase the limit to match the number of items on the IN.<br>
    <br>
    I didn't know CQL supported stored procedures, but I'll take a
    look.&nbsp; I suppose my question was asking about parsing overhead,
    however.&nbsp; If one big query doesn't cause problems--which I assume it
    wouldn't since there can be multiple threads parsing and I assume C*
    is smart about memory when accumulating results--I'd much rather do
    that.<br>
    <br>
    Dan<br>
    <br>
    <div class="moz-cite-prefix">On 11/6/13 3:05 PM, Nate McCall wrote:<br>
    </div>
    <blockquote cite="mid:CAKmMYa8oYM+vvihSw8SYyhXC-8DPJqjpJVX8eH--YYtSG6SZOQ@mail.gmail.com" type="cite">
      <div dir="ltr">Unless you explicitly set a page size (i'm pretty
        sure the query is converted to a paging query automatically
        under the hood) you will get capped at the default of 10k which
        might get a little weird semantically. That said, you should
        experiment with explicit page sizes and see where it gets you
        (i've not tried this yet with an IN clause - would be real
        curious to hear how it worked).&nbsp;
        <div>
          <br>
        </div>
        <div>Another thing to consider is that it's a pretty big
          statement to parse every time. You might want to go the (much)
          smaller batch route so these can be stored procedures?
          (another thing I havent tried with IN clause - don't see why
          it would not work though).</div>
        <div>
          <div><br>
          </div>
          <div><br>
          </div>
        </div>
      </div>
      <div class="gmail_extra"><br>
        <br>
        <div class="gmail_quote">On Wed, Nov 6, 2013 at 4:08 PM, Dan
          Gould <span dir="ltr">&lt;<a moz-do-not-send="true" href="mailto:dan@chill.com" target="_blank">dan@chill.com</a>&gt;</span>
          wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0
            .8ex;border-left:1px #ccc solid;padding-left:1ex">I was
            wondering if anyone had a sense of performance/best
            practices<br>
            around the 'IN' predicate.<br>
            <br>
            I have a list of up to potentially ~30k keys that I want to
            look up in a<br>
            table (typically queries will have &lt;500, but I worry
            about the long tail). &nbsp;Most<br>
            of them will not exist in the table, but, say, about 10-20%
            will.<br>
            <br>
            Would it be best to do:<br>
            <br>
            1) SELECT fields FROM table WHERE id in (uuid1, uuid2,
            ...... uuid30000);<br>
            <br>
            2) Split into smaller batches--<br>
            for group_of_100 in all_30000:<br>
            &nbsp; &nbsp;// ** Issue in parallel or block after each one??<br>
            &nbsp; &nbsp;SELECT fields FROM table WHERE id in (group_of_100
            uuids);<br>
            <br>
            3) Something else?<br>
            <br>
            My guess is that (1) is fine and that the only worry is too
            much data returned (which won't be a problem in this case),
            but I wanted to check that it's not a C* anti-pattern
            before.<br>
            <br>
            [Conversely, is a batch insert with up to 30k items ok?]<br>
            <br>
            Thanks,<br>
            Dan<br>
            <br>
          </blockquote>
        </div>
        <br>
        <br clear="all">
        <div><br>
        </div>
        -- <br>
        <div dir="ltr">-----------------<br>
          Nate McCall<br>
          Austin, TX<br>
          @zznate<br>
          <br>
          Co-Founder &amp; Sr. Technical Consultant<br>
          Apache Cassandra Consulting<br>
          <a moz-do-not-send="true" href="http://www.thelastpickle.com/" target="_blank">http://www.thelastpickle.com</a></div>
      </div>
    </blockquote>
    <br>
  </div>

</blockquote></div><br></div></body></html>
--Apple-Mail=_D744F110-F7D7-4A8A-AE14-12E0906DA6D6--