Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: <CAGNfgt89VgQLM70Ds74-QYx9byi_xL8en0ZduYWO3few76Rcog@mail.gmail.com>
References: <CAGDFUBm7U979gN1woOEdDwVEhR_58EyDpE9bqBvR_-xYDp8uUw@mail.gmail.com>
 <CAORswtxEavtYHdcW=UztjjgV9211xKtOD5Mt_5JzWviAY2tmew@mail.gmail.com>
 <CAGDFUBkrySwuhjTbzqUr0LNHk-1XG3ZdcFxzhg=yPRoWypBB9w@mail.gmail.com>
 <CAGNfgt8dGLpf+BHcUNfdbx7rx=gWEZopQPPTMsiV1QiF-8POzQ@mail.gmail.com>
 <CAGDFUBmnGZGbkUMsENDWW63ary6KZQMt0EaDZCf6R=yCX2v00w@mail.gmail.com> <CAGNfgt89VgQLM70Ds74-QYx9byi_xL8en0ZduYWO3few76Rcog@mail.gmail.com>
From: Avi Levi <avi@indeni.com>
Date: Tue, 22 Aug 2017 09:45:36 +0300
Message-ID: <CAGDFUBkTkEiNDQUrLq8GhcGzygYWiWSwG65Z7xC3VsXKLjCGew@mail.gmail.com>
Subject: Re: Getting all unique keys
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary="001a11c1868c5b9c1d055751f142"

--001a11c1868c5b9c1d055751f142
Content-Type: text/plain; charset="UTF-8"

Thanks Christophe, we will definitely consider that in the future.

On Mon, Aug 21, 2017 at 3:01 PM, Christophe Schmitz <
christophe@instaclustr.com> wrote:

> Hi Avi,
>
> The spark-project documentation is quite good, as well as the
> spark-cassandra-connector github project, which contains some basic
> examples you can easily get inspired from. A few random advice you might
> find usefull:
> - You will want one spark worker on each node, and a spark master on
> either one of the node, or on a separate node.
> - Pay close attention at your port configuration (firewall) as the spark
> error log does not always give you the right hint.
> - Pay close attention at your heap size. Make sure to configure your heap
> size such as Cassandra heap size + spark heap size < your node memory (take
> into account Cassandra off heap usage if enabled, OS etc...)
> - If your Cassandra data center is used in production, make sure you
> throttle read / write from Spark, pay attention to your latencies, and
> consider using a separate analytic cassandra data center if you get serious
> with Spark.
> - More or less everyone I know find that writing spark jobs in scala is
> natural, while writing them in java is painful :D
>
> Getting spark running will be a bit of an investment at the beginning, but
> overall you will find out it allows you to run queries you can't naturally
> run in Cassandra, like the one you described.
>
> Cheers,
>
> Christophe
>
> On 21 August 2017 at 16:16, Avi Levi <avi@indeni.com> wrote:
>
>> Thanks Christophe,
>> we didn't want to add too many moving parts but is sound like a good
>> solution. do you have any reference / link that I can look at ?
>>
>> Cheers
>> Avi
>>
>> On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz <
>> christophe@instaclustr.com> wrote:
>>
>>> Hi Avi,
>>>
>>> Have you thought of using Spark for that work? If you collocate the
>>> spark workers on each Cassandra nodes, the spark-cassandra connector will
>>> split automatically the token range for you in such a way that each spark
>>> worker only hit the Cassandra local node. This will also be done in
>>> parallel. Should be much faster that way.
>>>
>>> Cheers,
>>> Christophe
>>>
>>>
>>> On 21 August 2017 at 01:34, Avi Levi <avi@indeni.com> wrote:
>>>
>>>> Thank you very much , one question . you wrote that I do not need
>>>> distinct here since it's a part from the primary key. but only the
>>>> combination is unique (*PRIMARY KEY (id, timestamp) ) .* also if I
>>>> take the last token and feed it back as you showed wouldn't I get
>>>> overlapping boundaries ?
>>>>
>>>> On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens <mightye@gmail.com>
>>>> wrote:
>>>>
>>>>> You should be able to fairly efficiently iterate all the partition
>>>>> keys like:
>>>>>
>>>>> select id, token(id) from table where token(id) >=
>>>>> -9204925292781066255 limit 1000;
>>>>>  id                                         | system.token(id)
>>>>> --------------------------------------------+----------------------
>>>>> ...
>>>>>  0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686
>>>>>
>>>>> Take the last token you receive and feed it back in, skipping
>>>>> duplicates from the previous page (on the unlikely chance that you have two
>>>>> ID's with a token collision on the page boundary):
>>>>>
>>>>> select id, token(id) from table where token(id) >=
>>>>> -7821793584824523686 limit 1000;
>>>>>  id                                         | system.token(id)
>>>>> --------------------------------------------+---------------------
>>>>> ...
>>>>>  0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339
>>>>>
>>>>> Continue until you have no more results.  You don't really need
>>>>> distinct here: it's part of your primary key, it must already be distinct.
>>>>>
>>>>> If you want to parallelize it, split the ring into *n* ranges and
>>>>> include it as an upper bound for each segment.
>>>>>
>>>>> select id, token(id) from table where token(id) >=
>>>>> -9204925292781066255 AND token(id) < $rangeUpperBound limit 1000;
>>>>>
>>>>>
>>>>> On Sun, Aug 20, 2017 at 12:33 AM Avi Levi <avi@indeni.com> wrote:
>>>>>
>>>>>> I need to get all unique keys (not the complete primary key, just the
>>>>>> partition key) in order to aggregate all the relevant records of that key
>>>>>> and apply some calculations on it.
>>>>>>
>>>>>> *CREATE TABLE my_table (
>>>>>>
>>>>>>     id text,
>>>>>>
>>>>>>     timestamp bigint,
>>>>>>
>>>>>>     value double,
>>>>>>
>>>>>>     PRIMARY KEY (id, timestamp) )*
>>>>>>
>>>>>> I know that to query like this
>>>>>>
>>>>>> *SELECT DISTINCT id FROM my_table *
>>>>>>
>>>>>> is not very efficient but how about the approach presented here <http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/> sending queries in parallel and using the token
>>>>>>
>>>>>> *SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255 AND token(id) <= -9223372036854775808; *
>>>>>>
>>>>>> *or I can just maintain another table with the unique keys *
>>>>>>
>>>>>> *CREATE TABLE id_only ( id text,
>>>>>>
>>>>>>     PRIMARY KEY (id) )*
>>>>>>
>>>>>> but I tend not to since it is error prone and will enforce other procedures to maintain data integrity between those two tables .
>>>>>>
>>>>>> any ideas ?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Avi
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Christophe Schmitz*
>>> *Director of consulting EMEA*
>>>
>>
>>
>
>
> --
>
>
> *Christophe Schmitz*
> *Director of consulting EMEA*
>

--001a11c1868c5b9c1d055751f142
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Thanks=C2=A0<span style=3D"font-size:12.8px">Christophe, w=
e will definitely=C2=A0consider that in the future.</span></div><div class=
=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Aug 21, 2017 at 3:0=
1 PM, Christophe Schmitz <span dir=3D"ltr">&lt;<a href=3D"mailto:christophe=
@instaclustr.com" target=3D"_blank">christophe@instaclustr.com</a>&gt;</spa=
n> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;b=
order-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Hi Avi,<div><b=
r></div><div>The spark-project documentation is quite good, as well as the =
spark-cassandra-connector github project, which contains some basic example=
s you can easily get inspired from. A few random advice you might find usef=
ull:</div><div>- You will want one spark worker on each node, and a spark m=
aster on either one of the node, or on a separate node.</div><div>- Pay clo=
se attention at your port configuration (firewall) as the spark error log d=
oes not always give you the right hint.</div><div>- Pay close attention at =
your heap size. Make sure to configure your heap size such as Cassandra hea=
p size + spark heap size &lt; your node memory (take into account Cassandra=
 off heap usage if enabled, OS etc...)</div><div>- If your Cassandra data c=
enter is used in production, make sure you throttle read / write from Spark=
, pay attention to your latencies, and consider using a separate analytic c=
assandra data center if you get serious with Spark.</div><div>- More or les=
s everyone I know find that writing spark jobs in scala is natural, while w=
riting them in java is painful :D</div><div><br></div><div>Getting spark ru=
nning will be a bit of an investment at the beginning, but overall you will=
 find out it allows you to run queries you can&#39;t naturally run in Cassa=
ndra, like the one you described.</div><div><br></div><div>Cheers,</div><di=
v><br></div><div>Christophe</div><div><div class=3D"h5"><div class=3D"gmail=
_extra"><br><div class=3D"gmail_quote">On 21 August 2017 at 16:16, Avi Levi=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:avi@indeni.com" target=3D"_blank">=
avi@indeni.com</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div=
 dir=3D"ltr">Thanks=C2=A0<span style=3D"font-size:12.8px">Christophe,=C2=A0=
</span><div>we didn&#39;t want to add too many moving parts but is sound li=
ke a good solution. do you have any reference / link that I can look at ?</=
div><div><br></div><div>Cheers=C2=A0</div><span class=3D"m_-542524222520824=
0694HOEnZb"><font color=3D"#888888"><div>Avi</div></font></span></div><div =
class=3D"m_-5425242225208240694HOEnZb"><div class=3D"m_-5425242225208240694=
h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Aug 2=
1, 2017 at 3:43 AM, Christophe Schmitz <span dir=3D"ltr">&lt;<a href=3D"mai=
lto:christophe@instaclustr.com" target=3D"_blank">christophe@instaclustr.co=
m</a>&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margi=
n:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">=
Hi Avi,<div><br></div><div>Have you thought of using Spark for that work? I=
f you collocate the spark workers on each Cassandra nodes, the spark-cassan=
dra connector will split automatically the token range for you in such a wa=
y that each spark worker only hit the Cassandra local node. This will also =
be done in parallel. Should be much faster that way.</div><div><br></div><d=
iv>Cheers,</div><div>Christophe</div><div><br><div class=3D"gmail_extra"><d=
iv><div class=3D"m_-5425242225208240694m_-1723216914766814324h5"><br><div c=
lass=3D"gmail_quote">On 21 August 2017 at 01:34, Avi Levi <span dir=3D"ltr"=
>&lt;<a href=3D"mailto:avi@indeni.com" target=3D"_blank">avi@indeni.com</a>=
&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Thank=
 you very much , one question . you wrote that I do not need distinct here =
since it&#39;s a part from the primary key. but only the combination is uni=
que (<i style=3D"color:rgb(0,0,0);white-space:pre-wrap">PRIMARY KEY (id, ti=
mestamp) ) .</i><font color=3D"#000000"><span style=3D"white-space:pre-wrap=
">  also if I take the last token and feed it back as you showed wouldn&#39=
;t I get overlapping boundaries ? </span></font></div><div class=3D"m_-5425=
242225208240694m_-1723216914766814324m_-3617299435984540648m_45850435262521=
44588m_-5898877254067627458HOEnZb"><div class=3D"m_-5425242225208240694m_-1=
723216914766814324m_-3617299435984540648m_4585043526252144588m_-58988772540=
67627458h5"><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Su=
n, Aug 20, 2017 at 6:18 PM, Eric Stevens <span dir=3D"ltr">&lt;<a href=3D"m=
ailto:mightye@gmail.com" target=3D"_blank">mightye@gmail.com</a>&gt;</span>=
 wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;bor=
der-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">You should be ab=
le to fairly efficiently iterate all the partition keys like:<div><br></div=
><div><font face=3D"monospace">select id, token(id) from table where token(=
id) &gt;=3D -9204925292781066255 limit 1000;</font></div><div><div><font fa=
ce=3D"monospace">=C2=A0id =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 | system.token(id)</font></div><div><font face=3D"monospa=
ce">------------------------------<wbr>--------------+---------------<wbr>-=
------</font></div><div><font face=3D"monospace">...</font></div><div><font=
 face=3D"monospace">=C2=A00xb90ea1db5c29f2f6d435426dccf<wbr>77cca6320fac9 |=
 -7821793584824523686</font></div></div><div><br></div><div>Take the last t=
oken you receive and feed it back in, skipping duplicates from the previous=
 page (on the unlikely chance that you have two ID&#39;s with a token colli=
sion on the page boundary):</div><div><br></div><div><div><font face=3D"mon=
ospace">select id, token(id) from table where token(id) &gt;=3D -7821793584=
824523686=C2=A0limit 1000;<br></font></div><div><font face=3D"monospace">=
=C2=A0id =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 |=
 system.token(id)</font></div><div><font face=3D"monospace">---------------=
---------------<wbr>--------------+---------------<wbr>------</font></div><=
div><font face=3D"monospace">...</font></div><div><font face=3D"monospace">=
=C2=A00xc6289d729c9087fb5a1fe624b0b<wbr>883ab82a9bffe | -434806781044590339=
</font></div></div><div><br></div><div>Continue until you have no more resu=
lts.=C2=A0 You don&#39;t really need distinct here: it&#39;s part of your p=
rimary key, it must already be distinct.</div><div><br></div><div>If you wa=
nt to parallelize it, split the ring into <i>n</i>=C2=A0ranges and include =
it as an upper bound for each segment.</div><div><br></div><div><div><font =
face=3D"monospace">select id, token(id) from table where token(id) &gt;=3D =
-9204925292781066255 AND token(id) &lt; $rangeUpperBound limit 1000;<br></f=
ont></div><br class=3D"m_-5425242225208240694m_-1723216914766814324m_-36172=
99435984540648m_4585043526252144588m_-5898877254067627458m_-448244141926271=
0072m_-9083064995663947388inbox-inbox-Apple-interchange-newline"></div></di=
v><div class=3D"m_-5425242225208240694m_-1723216914766814324m_-361729943598=
4540648m_4585043526252144588m_-5898877254067627458m_-4482441419262710072HOE=
nZb"><div class=3D"m_-5425242225208240694m_-1723216914766814324m_-361729943=
5984540648m_4585043526252144588m_-5898877254067627458m_-4482441419262710072=
h5"><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Sun, Aug 20, 2017 at=
 12:33 AM Avi Levi &lt;<a href=3D"mailto:avi@indeni.com" target=3D"_blank">=
avi@indeni.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div =
dir=3D"ltr">I need to get all unique keys (not the complete primary key, ju=
st the partition key) in order to aggregate all the relevant records of tha=
t key and apply some calculations on it.<div><br></div><div><pre style=3D"w=
hite-space:pre-wrap;color:rgb(0,0,0)"><i>CREATE TABLE my_table (

    id text,

    timestamp bigint,

    value double,

    PRIMARY KEY (id, timestamp) )</i></pre><pre style=3D"white-space:pre-wr=
ap;color:rgb(0,0,0)">I know that to query like this </pre><pre style=3D"whi=
te-space:pre-wrap;color:rgb(0,0,0)"><i>SELECT DISTINCT id FROM my_table </i=
></pre><pre style=3D"white-space:pre-wrap;color:rgb(0,0,0)">is not very eff=
icient but how about the approach presented <a href=3D"http://www.scylladb.=
com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/" target=3D"_blan=
k">here</a> sending queries in parallel and using the token </pre><pre styl=
e=3D"white-space:pre-wrap;color:rgb(0,0,0)"><i><span style=3D"font-family:a=
rial,sans-serif">SELECT DISTINCT id FROM my_table </span>WHERE token(id) &g=
t;=3D -9204925292781066255 AND token(id) &lt;=3D -9223372036854775808; </i>=
</pre><pre style=3D"white-space:pre-wrap;color:rgb(0,0,0)"><i>or I can just=
 maintain another table with the unique keys </i></pre><pre><pre style=3D"c=
olor:rgb(0,0,0);white-space:pre-wrap"><i>CREATE TABLE id_only ( id text,

    PRIMARY KEY (id) )</i></pre><pre><font color=3D"#000000"><span style=3D=
"white-space:pre-wrap">but I tend not to since it is error prone and will e=
nforce other procedures to maintain data integrity between those two tables=
 .</span></font></pre><pre>any ideas ?</pre><pre>Thanks </pre></pre></div><=
/div><div dir=3D"ltr"><div><pre><pre>Avi</pre></pre></div></div></blockquot=
e></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div></div><=
/div><span class=3D"m_-5425242225208240694m_-1723216914766814324HOEnZb"><fo=
nt color=3D"#888888">-- <br><div class=3D"m_-5425242225208240694m_-17232169=
14766814324m_-3617299435984540648m_4585043526252144588m_-589887725406762745=
8gmail_signature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div =
dir=3D"ltr"><p style=3D"margin:10px 0px 0px;padding:0px;color:rgb(51,51,51)=
;font-family:Arial,sans-serif;font-size:14px"><strong><span style=3D"color:=
rgb(34,34,34)">Christophe Schmitz<br></span></strong><em><span style=3D"col=
or:rgb(34,34,34)">Director of consulting EMEA<br></span></em></p></div></di=
v></div>
</font></span></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear=3D"all"><div><br></div>-- <br>=
<div class=3D"m_-5425242225208240694gmail_signature" data-smartmail=3D"gmai=
l_signature"><div dir=3D"ltr"><div dir=3D"ltr"><p style=3D"margin:10px 0px =
0px;padding:0px;color:rgb(51,51,51);font-family:Arial,sans-serif;font-size:=
14px"><strong><span style=3D"color:rgb(34,34,34)">Christophe Schmitz<br></s=
pan></strong><em><span style=3D"color:rgb(34,34,34)">Director of consulting=
 EMEA<br></span></em><br></p></div></div></div>
</div></div></div></div>
</blockquote></div><br></div>

--001a11c1868c5b9c1d055751f142--