cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Re: Data partitioning and composite partition key
Date Sat, 30 Aug 2014 00:23:02 GMT
Okay, but what benefit do you think you get from having the partitions on the same node –
since they would be separate partitions anyway? I mean, what exactly do you think you’re
going to do with them, that wouldn’t be a whole lot more performant by being able to process
data in parallel from separate nodes? I mean, the whole point of Cassandra is scalability
and distributed processing, right?

-- Jack Krupansky

From: Drew Kutcharian 
Sent: Friday, August 29, 2014 7:31 PM
To: user@cassandra.apache.org 
Subject: Re: Data partitioning and composite partition key

Hi Jack, 

I think you missed the point of my email which was trying to avoid the problem of having very
wide rows :)  In the notation of sensorId-datatime, the datatime is a datetime bucket, say
a day. The CQL rows would still be keyed by the actual time of the event. So you’d end up
having SesonId->Datetime Bucket (day/week/month)->actual event. What I wanted to be
able to do was to colocate all the events related to a sensor id on a single node (token).

See "High Throughput Timelines” at http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra

- Drew


On Aug 29, 2014, at 3:58 PM, Jack Krupansky <jack@basetechnology.com> wrote:


  With CQL3, you, the developer, get to decide whether to place a primary key column in the
partition key or as a clustering column. So, make sensorID the partition key and datetime
as a clustering column.

  -- Jack Krupansky

  From: Drew Kutcharian 
  Sent: Friday, August 29, 2014 6:48 PM
  To: user@cassandra.apache.org 
  Subject: Data partitioning and composite partition key

  Hey Guys, 

  AFAIK, currently Cassandra partitions (thrift) rows using the row key, basically uses the
hash(row_key) to decide what node that row needs to be stored on. Now there are times when
there is a need to shard a wide row, say storing events per sensor, so you’d have sensorId-datetime
row key so you don’t end up with very large rows. Is there a way to have the partitioner
use only the “sensorId” part of the row key for the hash? This way we would be able to
store all the data relating to a sensor in one node.

  Another use case of this would be multi-tenancy:

  Say we have accounts and accounts have users. So we would have the following tables:

  CREATE TABLE account (
    id                     timeuuid PRIMARY KEY,
    company         text      //timezone
  );

  CREATE TABLE user (
    id              timeuuid PRIMARY KEY, 
    accountId timeuuid,
    email        text,
    password text
  );

  // Get users by account
  CREATE TABLE user_account_index (
    accountId  timeuuid,
    userId        timeuuid,
    PRIMARY KEY(acid, id)
  );

  Say I want to get all the users that belong to an account. I would first have to get the
results from user_account_index and then use a multi-get (WHERE IN) to get the records from
user table. Now this multi-get part could potentially query a lot of different nodes in the
cluster. It’d be great if there was a way to limit storage of users of an account to a single
node so that way multi-get would only need to query a single node. 

  Note that the problem cannot be simply fixed by using (accountId, id) as the primary key
for the user table since that would create a problem of having a very large number of (thrift)
rows in the users table.

  I did look thru the code and JIRA and I couldn’t really find a solution. The closest I
got was to have a custom partitioner, but then you can’t have a partitioner per keyspace
and that’s not even something that’d be implemented in future based on the following JIRA:
  https://issues.apache.org/jira/browse/CASSANDRA-295

  Any ideas are much appreciated.

  Best,

  Drew

Mime
View raw message