cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Avi Levi <...@indeni.com>
Subject Re: Getting all unique keys
Date Tue, 22 Aug 2017 06:45:36 GMT
Thanks Christophe, we will definitely consider that in the future.

On Mon, Aug 21, 2017 at 3:01 PM, Christophe Schmitz <
christophe@instaclustr.com> wrote:

> Hi Avi,
>
> The spark-project documentation is quite good, as well as the
> spark-cassandra-connector github project, which contains some basic
> examples you can easily get inspired from. A few random advice you might
> find usefull:
> - You will want one spark worker on each node, and a spark master on
> either one of the node, or on a separate node.
> - Pay close attention at your port configuration (firewall) as the spark
> error log does not always give you the right hint.
> - Pay close attention at your heap size. Make sure to configure your heap
> size such as Cassandra heap size + spark heap size < your node memory (take
> into account Cassandra off heap usage if enabled, OS etc...)
> - If your Cassandra data center is used in production, make sure you
> throttle read / write from Spark, pay attention to your latencies, and
> consider using a separate analytic cassandra data center if you get serious
> with Spark.
> - More or less everyone I know find that writing spark jobs in scala is
> natural, while writing them in java is painful :D
>
> Getting spark running will be a bit of an investment at the beginning, but
> overall you will find out it allows you to run queries you can't naturally
> run in Cassandra, like the one you described.
>
> Cheers,
>
> Christophe
>
> On 21 August 2017 at 16:16, Avi Levi <avi@indeni.com> wrote:
>
>> Thanks Christophe,
>> we didn't want to add too many moving parts but is sound like a good
>> solution. do you have any reference / link that I can look at ?
>>
>> Cheers
>> Avi
>>
>> On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz <
>> christophe@instaclustr.com> wrote:
>>
>>> Hi Avi,
>>>
>>> Have you thought of using Spark for that work? If you collocate the
>>> spark workers on each Cassandra nodes, the spark-cassandra connector will
>>> split automatically the token range for you in such a way that each spark
>>> worker only hit the Cassandra local node. This will also be done in
>>> parallel. Should be much faster that way.
>>>
>>> Cheers,
>>> Christophe
>>>
>>>
>>> On 21 August 2017 at 01:34, Avi Levi <avi@indeni.com> wrote:
>>>
>>>> Thank you very much , one question . you wrote that I do not need
>>>> distinct here since it's a part from the primary key. but only the
>>>> combination is unique (*PRIMARY KEY (id, timestamp) ) .* also if I
>>>> take the last token and feed it back as you showed wouldn't I get
>>>> overlapping boundaries ?
>>>>
>>>> On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens <mightye@gmail.com>
>>>> wrote:
>>>>
>>>>> You should be able to fairly efficiently iterate all the partition
>>>>> keys like:
>>>>>
>>>>> select id, token(id) from table where token(id) >=
>>>>> -9204925292781066255 limit 1000;
>>>>>  id                                         | system.token(id)
>>>>> --------------------------------------------+----------------------
>>>>> ...
>>>>>  0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686
>>>>>
>>>>> Take the last token you receive and feed it back in, skipping
>>>>> duplicates from the previous page (on the unlikely chance that you have
two
>>>>> ID's with a token collision on the page boundary):
>>>>>
>>>>> select id, token(id) from table where token(id) >=
>>>>> -7821793584824523686 limit 1000;
>>>>>  id                                         | system.token(id)
>>>>> --------------------------------------------+---------------------
>>>>> ...
>>>>>  0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339
>>>>>
>>>>> Continue until you have no more results.  You don't really need
>>>>> distinct here: it's part of your primary key, it must already be distinct.
>>>>>
>>>>> If you want to parallelize it, split the ring into *n* ranges and
>>>>> include it as an upper bound for each segment.
>>>>>
>>>>> select id, token(id) from table where token(id) >=
>>>>> -9204925292781066255 AND token(id) < $rangeUpperBound limit 1000;
>>>>>
>>>>>
>>>>> On Sun, Aug 20, 2017 at 12:33 AM Avi Levi <avi@indeni.com> wrote:
>>>>>
>>>>>> I need to get all unique keys (not the complete primary key, just
the
>>>>>> partition key) in order to aggregate all the relevant records of
that key
>>>>>> and apply some calculations on it.
>>>>>>
>>>>>> *CREATE TABLE my_table (
>>>>>>
>>>>>>     id text,
>>>>>>
>>>>>>     timestamp bigint,
>>>>>>
>>>>>>     value double,
>>>>>>
>>>>>>     PRIMARY KEY (id, timestamp) )*
>>>>>>
>>>>>> I know that to query like this
>>>>>>
>>>>>> *SELECT DISTINCT id FROM my_table *
>>>>>>
>>>>>> is not very efficient but how about the approach presented here <http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/>
sending queries in parallel and using the token
>>>>>>
>>>>>> *SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255
AND token(id) <= -9223372036854775808; *
>>>>>>
>>>>>> *or I can just maintain another table with the unique keys *
>>>>>>
>>>>>> *CREATE TABLE id_only ( id text,
>>>>>>
>>>>>>     PRIMARY KEY (id) )*
>>>>>>
>>>>>> but I tend not to since it is error prone and will enforce other
procedures to maintain data integrity between those two tables .
>>>>>>
>>>>>> any ideas ?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> Avi
>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>>
>>> *Christophe Schmitz*
>>> *Director of consulting EMEA*
>>>
>>
>>
>
>
> --
>
>
> *Christophe Schmitz*
> *Director of consulting EMEA*
>

Mime
View raw message