Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E9D8111FF9 for ; Fri, 20 Jun 2014 11:00:10 +0000 (UTC) Received: (qmail 39402 invoked by uid 500); 20 Jun 2014 11:00:06 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 39364 invoked by uid 500); 20 Jun 2014 11:00:06 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 39349 invoked by uid 99); 20 Jun 2014 11:00:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jun 2014 11:00:06 +0000 X-ASF-Spam-Status: No, hits=-6.5 required=5.0 tests=ENV_AND_HDR_SPF_MATCH,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,USER_IN_DEF_SPF_WL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of michael.laing@nytimes.com designates 209.85.216.182 as permitted sender) Received: from [209.85.216.182] (HELO mail-qc0-f182.google.com) (209.85.216.182) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 20 Jun 2014 11:00:02 +0000 Received: by mail-qc0-f182.google.com with SMTP id m20so3378256qcx.27 for ; Fri, 20 Jun 2014 03:59:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=scLNb6Va+CR2R7YfzVAU6FzcX9Hd1y0UZt1zJAEXqCE=; b=Fs/bRaJSFOc+aInsLi9oqiTWR2gQmIyYUeUo8tiZum9ZAudlIT8DkHcwgOfHfDgcDm SQbFQf9k2z7/hvFgL0PxTxDICqdRIQm0Qy8iHQmJMY/nzaeiekOS6bCdAI85zgWEA3AO kjJfdTa9ZxYxZYuhKlzE74mqLMsyxOqhXVQRyjcZ0U+yTm0TXiRwBKM6Ll6mxya3yr5G VQGR/xO30a2LzKnlgMLEmNwRcyZhshAGhs0tuGo66xJjoi+v87N8X6+Cw8mX/cNmisc5 4xNeNqsHq8OH3CKlyUp0zbWvrOOyQiyAjj3d04e+fy97dJc60ytsUxIDlqOQ1k/Ch3Xw fkvQ== X-Gm-Message-State: ALoCoQn3ebQClUlH1C+SkcjgI6CmhwmXOMl1PbYGyQ42qNBx8FxPAP3C1qxaZ88FI6pT+fvFy67h MIME-Version: 1.0 X-Received: by 10.140.95.23 with SMTP id h23mr3333497qge.12.1403261980050; Fri, 20 Jun 2014 03:59:40 -0700 (PDT) Received: by 10.140.109.180 with HTTP; Fri, 20 Jun 2014 03:59:39 -0700 (PDT) In-Reply-To: References: Date: Fri, 20 Jun 2014 06:59:39 -0400 Message-ID: Subject: Re: Best way to do a multi_get using CQL From: "Laing, Michael" To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11c15750d91db104fc426351 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c15750d91db104fc426351 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable However my extensive benchmarking this week of the python driver from master shows a performance *decrease* when using 'token_aware'. This is on 12-node, 2-datacenter, RF-3 cluster in AWS. Also why do the work the coordinator will do for you: send all the queries, wait for everything to come back in whatever order, and sort the result. I would rather keep my app code simple. But the real point is that you should benchmark in your own environment. ml On Fri, Jun 20, 2014 at 3:29 AM, Marcelo Elias Del Valle < marcelo@s1mbi0se.com.br> wrote: > Yes, I am using the CQL datastax drivers. > It was a good advice, thanks a lot Janathan. > []s > > > 2014-06-20 0:28 GMT-03:00 Jonathan Haddad : > > The only case in which it might be better to use an IN clause is if >> the entire query can be satisfied from that machine. Otherwise, go >> async. >> >> The native driver reuses connections and intelligently manages the >> pool for you. It can also multiplex queries over a single connection. >> >> I am assuming you're using one of the datastax drivers for CQL, btw. >> >> Jon >> >> On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle >> wrote: >> > This is interesting, I didn't know that! >> > It might make sense then to use select =3D + async + token aware, I wi= ll >> try >> > to change my code. >> > >> > But would it be a "recomended solution" for these cases? Any other >> options? >> > >> > I still would if this is the right use case for Cassandra, to look for >> > random keys in a huge cluster. After all, the amount of connections to >> > Cassandra will still be huge, right... Wouldn't it be a problem? >> > Or when you use async the driver reuses the connection? >> > >> > []s >> > >> > >> > 2014-06-19 22:16 GMT-03:00 Jonathan Haddad : >> > >> >> If you use async and your driver is token aware, it will go to the >> >> proper node, rather than requiring the coordinator to do so. >> >> >> >> Realistically you're going to have a connection open to every server >> >> anyways. It's the difference between you querying for the data >> >> directly and using a coordinator as a proxy. It's faster to just ask >> >> the node with the data. >> >> >> >> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle >> >> wrote: >> >> > But using async queries wouldn't be even worse than using SELECT IN= ? >> >> > The justification in the docs is I could query many nodes, but I >> would >> >> > still >> >> > do it. >> >> > >> >> > Today, I use both async queries AND SELECT IN: >> >> > >> >> > SELECT_ENTITY_LOOKUP =3D "SELECT entity_id FROM " + ENTITY_LOOKUP += " >> >> > WHERE >> >> > name=3D%s and value in(%s)" >> >> > >> >> > for name, values in identifiers.items(): >> >> > query =3D self.SELECT_ENTITY_LOOKUP % ('%s', >> >> > ','.join(['%s']*len(values))) >> >> > args =3D [name] + values >> >> > query_msg =3D query % tuple(args) >> >> > futures.append((query_msg, self.session.execute_async(query, >> args))) >> >> > >> >> > for query_msg, future in futures: >> >> > try: >> >> > rows =3D future.result(timeout=3D100000) >> >> > for row in rows: >> >> > entity_ids.add(row.entity_id) >> >> > except: >> >> > logging.error("Query '%s' returned ERROR " % (query_msg)) >> >> > raise >> >> > >> >> > Using async just with select =3D would mean instead of 1 async quer= y >> >> > (example: >> >> > in (0, 1, 2)), I would do several, one for each value of "values" >> array >> >> > above. >> >> > In my head, this would mean more connections to Cassandra and the >> same >> >> > amount of work, right? What would be the advantage? >> >> > >> >> > []s >> >> > >> >> > >> >> > >> >> > >> >> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad : >> >> > >> >> >> Your other option is to fire off async queries. It's pretty >> >> >> straightforward w/ the java or python drivers. >> >> >> >> >> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle >> >> >> wrote: >> >> >> > I was taking a look at Cassandra anti-patterns list: >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> http://www.datastax.com/documentation/cassandra/2.0/cassandra/architectu= re/architecturePlanningAntiPatterns_c.html >> >> >> > >> >> >> > Among then is >> >> >> > >> >> >> > SELECT ... IN or index lookups=C2=B6 >> >> >> > >> >> >> > SELECT ... IN and index lookups (formerly secondary indexes) >> should >> >> >> > be >> >> >> > avoided except for specific scenarios. See When not to use IN in >> >> >> > SELECT >> >> >> > and >> >> >> > When not to use an index in Indexing in >> >> >> > >> >> >> > CQL for Cassandra 2.0" >> >> >> > >> >> >> > And Looking at the SELECT doc, I saw: >> >> >> > >> >> >> > When not to use IN=C2=B6 >> >> >> > >> >> >> > The recommendations about when not to use an index apply to usin= g >> IN >> >> >> > in >> >> >> > the >> >> >> > WHERE clause. Under most conditions, using IN in the WHERE claus= e >> is >> >> >> > not >> >> >> > recommended. Using IN can degrade performance because usually ma= ny >> >> >> > nodes >> >> >> > must be queried. For example, in a single, local data center >> cluster >> >> >> > having >> >> >> > 30 nodes, a replication factor of 3, and a consistency level of >> >> >> > LOCAL_QUORUM, a single key query goes out to two nodes, but if t= he >> >> >> > query >> >> >> > uses the IN condition, the number of nodes being queried are mos= t >> >> >> > likely >> >> >> > even higher, up to 20 nodes depending on where the keys fall in >> the >> >> >> > token >> >> >> > range." >> >> >> > >> >> >> > In my system, I have a column family called "entity_lookup": >> >> >> > >> >> >> > CREATE KEYSPACE IF NOT EXISTS Identification1 >> >> >> > WITH REPLICATION =3D { 'class' : 'NetworkTopologyStrategy', >> >> >> > 'DC1' : 3 }; >> >> >> > USE Identification1; >> >> >> > >> >> >> > CREATE TABLE IF NOT EXISTS entity_lookup ( >> >> >> > name varchar, >> >> >> > value varchar, >> >> >> > entity_id uuid, >> >> >> > PRIMARY KEY ((name, value), entity_id)); >> >> >> > >> >> >> > And I use the following select to query it: >> >> >> > >> >> >> > SELECT entity_id FROM entity_lookup WHERE name=3D%s and value in= (%s) >> >> >> > >> >> >> > Is this an anti-pattern? >> >> >> > >> >> >> > If not using SELECT IN, which other way would you recomend for >> >> >> > lookups >> >> >> > like >> >> >> > that? I have several values I would like to search in cassandra >> and >> >> >> > they >> >> >> > might not be in the same particion, as above. >> >> >> > >> >> >> > Is Cassandra the wrong tool for lookups like that? >> >> >> > >> >> >> > Best regards, >> >> >> > Marcelo Valle. >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Jon Haddad >> >> >> http://www.rustyrazorblade.com >> >> >> skype: rustyrazorblade >> >> > >> >> > >> >> >> >> >> >> >> >> -- >> >> Jon Haddad >> >> http://www.rustyrazorblade.com >> >> skype: rustyrazorblade >> > >> > >> >> >> >> -- >> Jon Haddad >> http://www.rustyrazorblade.com >> skype: rustyrazorblade >> > > --001a11c15750d91db104fc426351 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
However my extensive benchmarking this week of the python = driver from master shows a performance decrease=C2=A0when using '= ;token_aware'.

This is on 12-node, 2-datacenter, RF-= 3 cluster in AWS.

Also why do the work the coordinator will do for you: s= end all the queries, wait for everything to come back in whatever order, an= d sort the result.

I would rather keep my app code= simple.

But the real point is that you should benchmark in your= own environment.

ml


On Fri, Jun 20, 2014 at 3:29 AM= , Marcelo Elias Del Valle <marcelo@s1mbi0se.com.br> wr= ote:
Yes, I am using the CQL dat= astax drivers.
It was a good advice, thanks a lot Janathan.
[= ]s


2014-06= -20 0:28 GMT-03:00 Jonathan Haddad <jon@jonhaddad.com>:
=

The only case in which it might be better to= use an IN clause is if
the entire query can be satisfied from that machine. =C2=A0Otherwise, go async.

The native driver reuses connections and intelligently manages the
pool for you. =C2=A0It can also multiplex queries over a single connection.=

I am assuming you're using one of the datastax drivers for CQL, btw.
Jon

On Thu, Jun 19, 2014 at 7:37 PM, Marcelo Elias Del Valle
<= marcelo@s1mbi0se.com.br> wrote:
> This is interesting, I didn't know that!
> It might make sense then to use select =3D + async + token aware, I wi= ll try
> to change my code.
>
> But would it be a "recomended solution" for these cases? Any= other options?
>
> I still would if this is the right use case for Cassandra, to look for=
> random keys in a huge cluster. After all, the amount of connections to=
> Cassandra will still be huge, right... Wouldn't it be a problem? > Or when you use async the driver reuses the connection?
>
> []s
>
>
> 2014-06-19 22:16 GMT-03:00 Jonathan Haddad <jon@jonhaddad.com>:
>
>> If you use async and your driver is token aware, it will go to the=
>> proper node, rather than requiring the coordinator to do so.
>>
>> Realistically you're going to have a connection open to every = server
>> anyways. =C2=A0It's the difference between you querying for th= e data
>> directly and using a coordinator as a proxy. =C2=A0It's faster= to just ask
>> the node with the data.
>>
>> On Thu, Jun 19, 2014 at 6:11 PM, Marcelo Elias Del Valle
>> <m= arcelo@s1mbi0se.com.br> wrote:
>> > But using async queries wouldn't be even worse than using= SELECT IN?
>> > The justification in the docs is I could query many nodes, bu= t I would
>> > still
>> > do it.
>> >
>> > Today, I use both async queries AND SELECT IN:
>> >
>> > SELECT_ENTITY_LOOKUP =3D "SELECT entity_id FROM " += ENTITY_LOOKUP + "
>> > WHERE
>> > name=3D%s and value in(%s)"
>> >
>> > for name, values in identifiers.items():
>> > =C2=A0 =C2=A0query =3D self.SELECT_ENTITY_LOOKUP % ('%s&#= 39;,
>> > ','.join(['%s']*len(values)))
>> > =C2=A0 =C2=A0args =3D [name] + values
>> > =C2=A0 =C2=A0query_msg =3D query % tuple(args)
>> > =C2=A0 =C2=A0futures.append((query_msg, self.session.execute_= async(query, args)))
>> >
>> > for query_msg, future in futures:
>> > =C2=A0 =C2=A0try:
>> > =C2=A0 =C2=A0 =C2=A0 rows =3D future.result(timeout=3D100000)=
>> > =C2=A0 =C2=A0 =C2=A0 for row in rows:
>> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 entity_ids.add(row.entity_id)
>> > =C2=A0 =C2=A0except:
>> > =C2=A0 =C2=A0 =C2=A0 logging.error("Query '%s' r= eturned ERROR " % (query_msg))
>> > =C2=A0 =C2=A0 =C2=A0 raise
>> >
>> > Using async just with select =3D would mean instead of 1 asyn= c query
>> > (example:
>> > in (0, 1, 2)), I would do several, one for each value of &quo= t;values" array
>> > above.
>> > In my head, this would mean more connections to Cassandra and= the same
>> > amount of work, right? What would be the advantage?
>> >
>> > []s
>> >
>> >
>> >
>> >
>> > 2014-06-19 22:01 GMT-03:00 Jonathan Haddad <jon@jonhaddad.com>:
>> >
>> >> Your other option is to fire off async queries. =C2=A0It&= #39;s pretty
>> >> straightforward w/ the java or python drivers.
>> >>
>> >> On Thu, Jun 19, 2014 at 5:56 PM, Marcelo Elias Del Valle<= br> >> >> <marcelo@s1mbi0se.com.br> wrote:
>> >> > I was taking a look at Cassandra anti-patterns list:=
>> >> >
>> >> >
>> >> >
>> >> > http://www.datastax.com/documentation/cassandra/2.0/cassa= ndra/architecture/architecturePlanningAntiPatterns_c.html
>> >> >
>> >> > Among then is
>> >> >
>> >> > SELECT ... IN or index lookups=C2=B6
>> >> >
>> >> > SELECT ... IN and index lookups (formerly secondary = indexes) should
>> >> > be
>> >> > avoided except for specific scenarios. See When not = to use IN in
>> >> > SELECT
>> >> > and
>> >> > When not to use an index in Indexing in
>> >> >
>> >> > CQL for Cassandra 2.0"
>> >> >
>> >> > And Looking at the SELECT doc, I saw:
>> >> >
>> >> > When not to use IN=C2=B6
>> >> >
>> >> > The recommendations about when not to use an index a= pply to using IN
>> >> > in
>> >> > the
>> >> > WHERE clause. Under most conditions, using IN in the= WHERE clause is
>> >> > not
>> >> > recommended. Using IN can degrade performance becaus= e usually many
>> >> > nodes
>> >> > must be queried. For example, in a single, local dat= a center cluster
>> >> > having
>> >> > 30 nodes, a replication factor of 3, and a consisten= cy level of
>> >> > LOCAL_QUORUM, a single key query goes out to two nod= es, but if the
>> >> > query
>> >> > uses the IN condition, the number of nodes being que= ried are most
>> >> > likely
>> >> > even higher, up to 20 nodes depending on where the k= eys fall in the
>> >> > token
>> >> > range."
>> >> >
>> >> > In my system, I have a column family called "en= tity_lookup":
>> >> >
>> >> > CREATE KEYSPACE IF NOT EXISTS Identification1
>> >> > =C2=A0 WITH REPLICATION =3D { 'class' : '= ;NetworkTopologyStrategy',
>> >> > =C2=A0 'DC1' : 3 };
>> >> > USE Identification1;
>> >> >
>> >> > CREATE TABLE IF NOT EXISTS entity_lookup (
>> >> > =C2=A0 name varchar,
>> >> > =C2=A0 value varchar,
>> >> > =C2=A0 entity_id uuid,
>> >> > =C2=A0 PRIMARY KEY ((name, value), entity_id));
>> >> >
>> >> > And I use the following select to query it:
>> >> >
>> >> > SELECT entity_id FROM entity_lookup WHERE name=3D%s = and value in(%s)
>> >> >
>> >> > Is this an anti-pattern?
>> >> >
>> >> > If not using SELECT IN, which other way would you re= comend for
>> >> > lookups
>> >> > like
>> >> > that? I have several values I would like to search i= n cassandra and
>> >> > they
>> >> > might not be in the same particion, as above.
>> >> >
>> >> > Is Cassandra the wrong tool for lookups like that? >> >> >
>> >> > Best regards,
>> >> > Marcelo Valle.
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jon Haddad
>> >> http://www.rustyrazorblade.com
>> >> skype: rustyrazorblade
>> >
>> >
>>
>>
>>
>> --
>> Jon Haddad
>> http:= //www.rustyrazorblade.com
>> skype: rustyrazorblade
>
>



--
Jon Haddad
http://www.rus= tyrazorblade.com
skype: rustyrazorblade


--001a11c15750d91db104fc426351--