Of course what you really want is this:

create table x(
  id text,
  timestamp timeuuid,
  flag boolean,
  // other fields
  primary key (flag, id, timestamp)
)

Whoops now there are only 2 partition keys! Not good if you have any reasonable number of rows...

Faced with a situation like this (although this is extreme) of a limited number of partition keys - and if this access path is important - then you can add shards like this:

create table x(
  id text,
  timestamp timeuuid,
  flag boolean,
  // other fields
  shard int,
  primary key ((flag, shard), id, timestamp)
)

The last part of the partition key may be used with IN so you can query like this:

select * from x where flag=true and shard in (0,1,2,3,4,5,6,...);

I monitor partition sizes and shard enough to keep them reasonable in this sort of situation. The C* infrastructure parallelizes a lot of the activity so such queries are quite fast. Oh, and ORDER BY works across shards.

But the main point is: drive from your queries. Designing for C* is NOT like SQL - don't expect to develop a normalized set of tables to do it all. Start with how you want to access data and design from there.

So - if you need to get a bunch of ids fast given a flag and maybe an id/timestamp range, and your volumes/sizes are such that the number of shards can be kept reasonable, this might be a good design, otherwise its crap. Drive from your own access patterns to derive your (typically denormalized) table defs.

ml



On Fri, Mar 21, 2014 at 3:34 PM, DuyHai Doan <doanduyhai@gmail.com> wrote:
Hello Ben

 Try the following alternative with composite partition key to encode the dual states of the boolean:


create table x(
  id text,
  flag boolean,
  timestamp timeuuid,
    // other fields
  primary key ((id,flag) timestamp)
)

Your previous "select * from x where flag = true;"  translate into:

 SELECT * FROM x WHERE id=... AND flag = true

Of course, you'll need to provide the id in any case.

 If you want to query only on the boolean flag, I'm afraid that manual indexing or secondary index (beware of cardinality !) are your only choices.

Regards

 Duy Hai DOAN




On Fri, Mar 21, 2014 at 8:27 PM, Ben Hood <0x6e6562@gmail.com> wrote:
Hi,

I was wondering what the best way is to lay column families out so
that you can to query by a boolean attribute.

For example:

create table x(
  id text,
  timestamp timeuuid,
  flag boolean,
  // other fields
  primary key (id, timestamp)
)

So that you can query

select * from x where flag = true;

Three approaches spring to mind:

1) Put a secondary index on the flag column
2) Split the column family definition out into two separate CFs, one
for true and one for false
3) Maintain a manual index table
4) Something else

Option (1) seems to be the easiest, but I was wondering if that was
going to put too much load on the secondary index, given the low
cardinality of this attribute. It seems like what Cassandra would have
to do internally to achieve (1) is to effectively implement (2) or (3)
behind the scenes, but I'm just guessing.

Does anybody have any experience with this?

Cheers,

Ben