incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Hood <0x6e6...@gmail.com>
Subject Re: Data model for boolean attributes
Date Sat, 22 Mar 2014 03:32:56 GMT
On Sat, Mar 22, 2014 at 1:31 AM, Laing, Michael
<michael.laing@nytimes.com> wrote:
> Whoops now there are only 2 partition keys! Not good if you have any
> reasonable number of rows...

Yes, this column family will have a large number of rows.

> I monitor partition sizes and shard enough to keep them reasonable in this
> sort of situation. The C* infrastructure parallelizes a lot of the activity
> so such queries are quite fast. Oh, and ORDER BY works across shards.

So reading between the lines here, the advantage of this sharded
approach is that you can now order the id field, because in this
variant, (flag,shard) is the partition key and id becomes a cluster
key.

> But the main point is: drive from your queries. Designing for C* is NOT like
> SQL - don't expect to develop a normalized set of tables to do it all. Start
> with how you want to access data and design from there.

That's a good point. Given this, what is the advantage of the sharded
approach over maintaining two separate column families, i.e.

create table x_true (
  id text,
  timestamp timeuuid,
  // other fields
  primary key (id, timestamp)
)

create table x_false (
  id text,
  timestamp timeuuid,
  // other fields
  primary key (id, timestamp)
)

In this scenario, your app reads and writes to/from the appropriate
column family, depending on the value of the flag.

On face value, the advantage of the sharded approach is that you can
ORDER BY id. The advantage of the column family per flag approach is
that you don't have to manage and monitor shards.

> So - if you need to get a bunch of ids fast given a flag and maybe an
> id/timestamp range, and your volumes/sizes are such that the number of
> shards can be kept reasonable, this might be a good design, otherwise its
> crap. Drive from your own access patterns to derive your (typically
> denormalized) table defs.

Also a very good point. The main query paths the app needs to support are:

select * from x where flag=true and id = ? and timestamp >= ? and timestamp <= ?
select * from x where flag=false and id = ? and timestamp >= ? and
timestamp <= ?

In this app, ordering by id is not important.

Mime
View raw message