cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Drew Kutcharian (JIRA)" <>
Subject [jira] [Comment Edited] (CASSANDRA-7850) Composite Aware Partitioner
Date Sat, 30 Aug 2014 02:01:54 GMT


Drew Kutcharian edited comment on CASSANDRA-7850 at 8/30/14 2:00 AM:

Yes, but then I might end up with very wide _thrift_ rows.

Basically what I want is {{PRIMARY KEY ((block_id, breed_bucket), breed)}} where records with
same block_id get stored on the same node *regardless* of the value of breed_bucket. But I
don't want to use {{PRIMARY KEY (block_id, breed_bucket, breed)}} since in that case all the
records for a block_id would end up in a single _thrift_ row.

So, ideally the layout would be:
block_id -> decides the node
(block_id, breed_bucket) -> decides the _thrift_ row. Old school "row key"
breed -> prefix of _thrift_ columns. Old school "column name prefix"

was (Author: drew_kutchar):
Yes, but then I might end up with very wide rows.

Basically what I want is {{PRIMARY KEY ((block_id, breed_bucket), breed)}} where records with
same block_id and breed_bucket get stored on the same node, but in different _thrift_ rows
so I don't have very wide rows (millions of _thrift_ columns per _thrift_ row). 

> Composite Aware Partitioner
> ---------------------------
>                 Key: CASSANDRA-7850
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Drew Kutcharian
> Since C* supports composites for partition keys, I think it'd be useful to have the ability
to only use first (or first few) components of the key to calculate the token hash.
> A naive use case would be multi-tenancy:
> Say we have accounts and accounts have users. So we would have the following tables:
> {code}
> CREATE TABLE account (
>   id                     timeuuid PRIMARY KEY,
>   company         text
> );
> {code}
> {code}
>   id              timeuuid PRIMARY KEY, 
>   accountId timeuuid,
>   email        text,
>   password text
> );
> {code}
> {code}
> // Get users by account
> CREATE TABLE user_account_index (
>   accountId  timeuuid,
>   userId        timeuuid,
>   PRIMARY KEY(acid, id)
> );
> {code}
> Say we want to get all the users that belong to an account. We would first have to get
the results from user_account_index and then use a multi-get (WHERE IN) to get the records
from user table. Now this multi-get part could potentially query a lot of different nodes
in the cluster. It’d be great if there was a way to limit storage of users of an account
to a single node so that way multi-get would only need to query a single node.
> With this improvement we would be able to define the user table like so:
> {code}
>   id              timeuuid, 
>   accountId timeuuid,
>   email        text,
>   password text,
>   PRIMARY KEY(((accountId),id))  //extra parentheses
> );
> {code}
> I'm not too sure about the notation, it could be something like PRIMARY KEY(((accountId),id))
where the "(accountId)" means use this part to calculate the hash and ((accountId),id) is
the actual partition key.
> The main complication I see with this is that we would have to use the table definition
when calculating hashes so we know what components of the partition keys need to be used for
hash calculation.

This message was sent by Atlassian JIRA

View raw message