Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7B5561664D2 for ; Tue, 22 Aug 2017 08:45:44 +0200 (CEST) Received: (qmail 1019 invoked by uid 500); 22 Aug 2017 06:45:42 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 1009 invoked by uid 99); 22 Aug 2017 06:45:42 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 22 Aug 2017 06:45:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id EF6AF1A025E for ; Tue, 22 Aug 2017 06:45:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.589 X-Spam-Level: ** X-Spam-Status: No, score=2.589 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001, T_DKIM_INVALID=0.01] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: not available)" header.d=indeni.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id Vz33ceZmiktp for ; Tue, 22 Aug 2017 06:45:39 +0000 (UTC) Received: from mail-oi0-f49.google.com (mail-oi0-f49.google.com [209.85.218.49]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 8A2925FB2F for ; Tue, 22 Aug 2017 06:45:38 +0000 (UTC) Received: by mail-oi0-f49.google.com with SMTP id r200so40884651oie.2 for ; Mon, 21 Aug 2017 23:45:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=indeni.com; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=OUQy+cwa994RNvUJqiEHgITQ1HeRW1Bfh+8p/+yZZfc=; b=do5BCCRH/siCxdZU2iBFyn5HYE+w6/VGr+6QwuKnFGxiqidrF3AmghdheJT9+p81Ps nm4y4O8D90YlVgdP949l/qlLCqLBMZ4pGk3IA45WmPQfkTTJlFINvI1xyeqaonuJZ++L nKNdVka+NTFN+GRKWE0lZeue02CnVJA+3YBag= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=OUQy+cwa994RNvUJqiEHgITQ1HeRW1Bfh+8p/+yZZfc=; b=PwW3robfCkgZ8PAfKcYxvXPSvwx5i693IqFeuFrPkOt28lKu4yeeaMINBW9jG1qZa8 0q+SJ5dWqoDAtGwMrMwNB/uYsQGJexC1hmqvmCZVLF9e8ZKqan19n2/OfCUO5sf+q0lZ i83yZUmMww6+EF68RxL+9Pas7V4YAewDiqj+vjLTY4425o+4bAiJvZprL2G5QdGCx4BF eWZZAn72p9jTuRZAFBHI8gvnnPk4stFnyF05aiqoDGglszJlcEK9iWXNirX5zzXTkESf f1tmX4ge8SkiJDLEDMt7EqxK3simEJmfYAO+ibvzwzqWCnOxrqVxJ8BhG690pzQGqhYa K+Ww== X-Gm-Message-State: AHYfb5i/fXfclbGzNU4UYKDvpbGy1fa1BzlMzyFYdQqKCp8YmOpIgBa3 KIDpK76BTL6NnWD0htyv5h/gmmdubSm6W5sbKQ== X-Received: by 10.202.73.208 with SMTP id w199mr6552535oia.25.1503384336826; Mon, 21 Aug 2017 23:45:36 -0700 (PDT) MIME-Version: 1.0 Received: by 10.157.54.203 with HTTP; Mon, 21 Aug 2017 23:45:36 -0700 (PDT) In-Reply-To: References: From: Avi Levi Date: Tue, 22 Aug 2017 09:45:36 +0300 Message-ID: Subject: Re: Getting all unique keys To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="001a11c1868c5b9c1d055751f142" --001a11c1868c5b9c1d055751f142 Content-Type: text/plain; charset="UTF-8" Thanks Christophe, we will definitely consider that in the future. On Mon, Aug 21, 2017 at 3:01 PM, Christophe Schmitz < christophe@instaclustr.com> wrote: > Hi Avi, > > The spark-project documentation is quite good, as well as the > spark-cassandra-connector github project, which contains some basic > examples you can easily get inspired from. A few random advice you might > find usefull: > - You will want one spark worker on each node, and a spark master on > either one of the node, or on a separate node. > - Pay close attention at your port configuration (firewall) as the spark > error log does not always give you the right hint. > - Pay close attention at your heap size. Make sure to configure your heap > size such as Cassandra heap size + spark heap size < your node memory (take > into account Cassandra off heap usage if enabled, OS etc...) > - If your Cassandra data center is used in production, make sure you > throttle read / write from Spark, pay attention to your latencies, and > consider using a separate analytic cassandra data center if you get serious > with Spark. > - More or less everyone I know find that writing spark jobs in scala is > natural, while writing them in java is painful :D > > Getting spark running will be a bit of an investment at the beginning, but > overall you will find out it allows you to run queries you can't naturally > run in Cassandra, like the one you described. > > Cheers, > > Christophe > > On 21 August 2017 at 16:16, Avi Levi wrote: > >> Thanks Christophe, >> we didn't want to add too many moving parts but is sound like a good >> solution. do you have any reference / link that I can look at ? >> >> Cheers >> Avi >> >> On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz < >> christophe@instaclustr.com> wrote: >> >>> Hi Avi, >>> >>> Have you thought of using Spark for that work? If you collocate the >>> spark workers on each Cassandra nodes, the spark-cassandra connector will >>> split automatically the token range for you in such a way that each spark >>> worker only hit the Cassandra local node. This will also be done in >>> parallel. Should be much faster that way. >>> >>> Cheers, >>> Christophe >>> >>> >>> On 21 August 2017 at 01:34, Avi Levi wrote: >>> >>>> Thank you very much , one question . you wrote that I do not need >>>> distinct here since it's a part from the primary key. but only the >>>> combination is unique (*PRIMARY KEY (id, timestamp) ) .* also if I >>>> take the last token and feed it back as you showed wouldn't I get >>>> overlapping boundaries ? >>>> >>>> On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens >>>> wrote: >>>> >>>>> You should be able to fairly efficiently iterate all the partition >>>>> keys like: >>>>> >>>>> select id, token(id) from table where token(id) >= >>>>> -9204925292781066255 limit 1000; >>>>> id | system.token(id) >>>>> --------------------------------------------+---------------------- >>>>> ... >>>>> 0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686 >>>>> >>>>> Take the last token you receive and feed it back in, skipping >>>>> duplicates from the previous page (on the unlikely chance that you have two >>>>> ID's with a token collision on the page boundary): >>>>> >>>>> select id, token(id) from table where token(id) >= >>>>> -7821793584824523686 limit 1000; >>>>> id | system.token(id) >>>>> --------------------------------------------+--------------------- >>>>> ... >>>>> 0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339 >>>>> >>>>> Continue until you have no more results. You don't really need >>>>> distinct here: it's part of your primary key, it must already be distinct. >>>>> >>>>> If you want to parallelize it, split the ring into *n* ranges and >>>>> include it as an upper bound for each segment. >>>>> >>>>> select id, token(id) from table where token(id) >= >>>>> -9204925292781066255 AND token(id) < $rangeUpperBound limit 1000; >>>>> >>>>> >>>>> On Sun, Aug 20, 2017 at 12:33 AM Avi Levi wrote: >>>>> >>>>>> I need to get all unique keys (not the complete primary key, just the >>>>>> partition key) in order to aggregate all the relevant records of that key >>>>>> and apply some calculations on it. >>>>>> >>>>>> *CREATE TABLE my_table ( >>>>>> >>>>>> id text, >>>>>> >>>>>> timestamp bigint, >>>>>> >>>>>> value double, >>>>>> >>>>>> PRIMARY KEY (id, timestamp) )* >>>>>> >>>>>> I know that to query like this >>>>>> >>>>>> *SELECT DISTINCT id FROM my_table * >>>>>> >>>>>> is not very efficient but how about the approach presented here sending queries in parallel and using the token >>>>>> >>>>>> *SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255 AND token(id) <= -9223372036854775808; * >>>>>> >>>>>> *or I can just maintain another table with the unique keys * >>>>>> >>>>>> *CREATE TABLE id_only ( id text, >>>>>> >>>>>> PRIMARY KEY (id) )* >>>>>> >>>>>> but I tend not to since it is error prone and will enforce other procedures to maintain data integrity between those two tables . >>>>>> >>>>>> any ideas ? >>>>>> >>>>>> Thanks >>>>>> >>>>>> Avi >>>>>> >>>>>> >>>> >>> >>> >>> -- >>> >>> >>> *Christophe Schmitz* >>> *Director of consulting EMEA* >>> >> >> > > > -- > > > *Christophe Schmitz* > *Director of consulting EMEA* > --001a11c1868c5b9c1d055751f142 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks=C2=A0Christophe, w= e will definitely=C2=A0consider that in the future.

On Mon, Aug 21, 2017 at 3:0= 1 PM, Christophe Schmitz <christophe@instaclustr.com> wrote:
Hi Avi,
The spark-project documentation is quite good, as well as the = spark-cassandra-connector github project, which contains some basic example= s you can easily get inspired from. A few random advice you might find usef= ull:
- You will want one spark worker on each node, and a spark m= aster on either one of the node, or on a separate node.
- Pay clo= se attention at your port configuration (firewall) as the spark error log d= oes not always give you the right hint.
- Pay close attention at = your heap size. Make sure to configure your heap size such as Cassandra hea= p size + spark heap size < your node memory (take into account Cassandra= off heap usage if enabled, OS etc...)
- If your Cassandra data c= enter is used in production, make sure you throttle read / write from Spark= , pay attention to your latencies, and consider using a separate analytic c= assandra data center if you get serious with Spark.
- More or les= s everyone I know find that writing spark jobs in scala is natural, while w= riting them in java is painful :D

Getting spark ru= nning will be a bit of an investment at the beginning, but overall you will= find out it allows you to run queries you can't naturally run in Cassa= ndra, like the one you described.

Cheers,

Christophe

On 21 August 2017 at 16:16, Avi Levi= <= avi@indeni.com> wrote:
Thanks=C2=A0Christophe,=C2=A0=
we didn't want to add too many moving parts but is sound li= ke a good solution. do you have any reference / link that I can look at ?

Cheers=C2=A0
Avi

On Mon, Aug 2= 1, 2017 at 3:43 AM, Christophe Schmitz <christophe@instaclustr.co= m> wrote:
= Hi Avi,

Have you thought of using Spark for that work? I= f you collocate the spark workers on each Cassandra nodes, the spark-cassan= dra connector will split automatically the token range for you in such a wa= y that each spark worker only hit the Cassandra local node. This will also = be done in parallel. Should be much faster that way.

Cheers,
Christophe


On 21 August 2017 at 01:34, Avi Levi <avi@indeni.com= > wrote:
Thank= you very much , one question . you wrote that I do not need distinct here = since it's a part from the primary key. but only the combination is uni= que (PRIMARY KEY (id, ti= mestamp) ) . also if I take the last token and feed it back as you showed wouldn'= ;t I get overlapping boundaries ?

On Su= n, Aug 20, 2017 at 6:18 PM, Eric Stevens <mightye@gmail.com>= wrote:
You should be ab= le to fairly efficiently iterate all the partition keys like:

select id, token(id) from table where token(= id) >=3D -9204925292781066255 limit 1000;
=C2=A0id =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 | system.token(id)
--------------------------------------------+----------------= ------
...
=C2=A00xb90ea1db5c29f2f6d435426dccf77cca6320fac9 |= -7821793584824523686

Take the last t= oken you receive and feed it back in, skipping duplicates from the previous= page (on the unlikely chance that you have two ID's with a token colli= sion on the page boundary):

select id, token(id) from table where token(id) >=3D -7821793584= 824523686=C2=A0limit 1000;
= =C2=A0id =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |= system.token(id)
---------------= -----------------------------+---------------------
<= div>...
= =C2=A00xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339=

Continue until you have no more resu= lts.=C2=A0 You don't really need distinct here: it's part of your p= rimary key, it must already be distinct.

If you wa= nt to parallelize it, split the ring into n=C2=A0ranges and include = it as an upper bound for each segment.

select id, token(id) from table where token(id) >=3D = -9204925292781066255 AND token(id) < $rangeUpperBound limit 1000;


On Sun, Aug 20, 2017 at= 12:33 AM Avi Levi <= avi@indeni.com> wrote:
I need to get all unique keys (not the complete primary key, ju= st the partition key) in order to aggregate all the relevant records of tha= t key and apply some calculations on it.

CREATE TABLE my_table (

    id text,

    timestamp bigint,

    value double,

    PRIMARY KEY (id, timestamp) )
I know that to query like this 
SELECT DISTINCT id FROM my_table 
is not very eff=
icient but how about the approach presented here sending queries in parallel and using the token 
SELECT DISTINCT id FROM my_table WHERE token(id) &g=
t;=3D -9204925292781066255 AND token(id) <=3D -9223372036854775808; =
or I can just=
 maintain another table with the unique keys 
CREATE TABLE id_only ( id text,

    PRIMARY KEY (id) )
but I tend not to since it is error prone and will e=
nforce other procedures to maintain data integrity between those two tables=
 .
any ideas ?
Thanks 
<= /div>
Avi




<= /div>--

Christophe Schmitz
Director of consulting EMEA





--
=

Christophe Schmitz
Director of consulting= EMEA


--001a11c1868c5b9c1d055751f142--