cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From srungarapu vamsi <srungarapu1...@gmail.com>
Subject Re: Need Column Family Schema Suggestion
Date Wed, 27 Jan 2016 06:33:35 GMT
Jack,
This is one of the analytics jobs i have to run. For the given problem, i
want to optimize the schema so that instead of loading the data as rdd to
spark machines , i want to get the direct number from cassandra queries.
The rationale behind this logic is i want to save on spark machine types :)

On Wed, 27 Jan 2016 at 02:07 Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> Step 1 in data modeling in Cassandra is to define all of your queries. Are
> these in fact the ONLY queries that you need?
>
> If you are doing significant analytics, Spark is indeed the way to go.
>
> Cassandra works best for point queries and narrow slice queries (sequence
> of consecutive rows within a single partition).
>
> -- Jack Krupansky
>
> On Tue, Jan 26, 2016 at 4:46 AM, srungarapu vamsi <
> srungarapu1989@gmail.com> wrote:
>
>> Hi,
>> I have the following use case:
>> A product (P) has 3 or more Devices associated with it. Each device (Di)
>> emits a set of names (size of the set is less than or equal to 250) every
>> minute.
>> Now the ask is: Compute the function f(product,hour) which is defined as
>> follows:
>> *foo*(*product*,*hour*) = Number of strings which are seen by all the
>> devices associated with the given *product *in the given *hour*.
>> Example:
>> Lets say product p1 has devices d1,d2,d3 associated with it.
>> Lets say S(d,h) is the *set* of names seen by the device d in hour h.
>> So, now foo(p1,h) = length(S(d1,h) intersect S(d2,h) intersect S(d3,h))
>>
>> I came up with the following approaches but i am not convinced with them:
>> Approach A.
>> Create a column family with the following schema:
>> column family name : hour_data
>> hour_data(hour,product,name,device_id_set)
>> device_id_set is Set<String>
>> Primary Key: (hour,product,name)
>> *Issue*:
>> I can't just run a query like SELECT COUNT(*) FROM hour_data where
>> hour=<h> and product=p and length(device_id_set)=3 as querying on
>> collections is not possible
>>
>> Approach B.
>> Create a column family with the following schema:
>> column family name : hour_data
>> hour_data(hour,product,name,num_devices_counter)
>> num_devices_counter is counter
>> Primary Key: (hour,product,name)
>> *Issue*:
>> I can't just run a query like SELECT COUNT(*) FROM hour_data where
>> hour=<h> and product=p and num_devices_counter=3 as querying on collections
>> is not possible
>>
>> Approach C.
>> Column family schema:
>> hour_data(hour,device,name)
>> Primary Key: (hour,device,name)
>> If we have to compute foo(p1,h) then read the data for every deice from *hour_data
>> *and perform intersection in spark.
>> *Issue*:
>> This is a heavy operation and is demanding big and multiple machines.
>>
>> Could you please help me in refining the Schemas or defining a new schema
>> to solve my problem?
>>
>>
>

Mime
View raw message