incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Laing, Michael" <michael.la...@nytimes.com>
Subject Re: Data model for boolean attributes
Date Sat, 22 Mar 2014 01:31:34 GMT
Of course what you really want is this:

create table x(
  id text,
  timestamp timeuuid,
  flag boolean,
  // other fields
  primary key (flag, id, timestamp)
)

Whoops now there are only 2 partition keys! Not good if you have any
reasonable number of rows...

Faced with a situation like this (although this is extreme) of a limited
number of partition keys - and if this access path is important - then you
can add shards like this:

create table x(
  id text,
  timestamp timeuuid,
  flag boolean,
  // other fields
  shard int,
  primary key ((flag, shard), id, timestamp)
)

The last part of the partition key may be used with IN so you can query
like this:

select * from x where flag=true and shard in (0,1,2,3,4,5,6,...);

I monitor partition sizes and shard enough to keep them reasonable in this
sort of situation. The C* infrastructure parallelizes a lot of the activity
so such queries are quite fast. Oh, and ORDER BY works across shards.

But the main point is: drive from your queries. Designing for C* is NOT
like SQL - don't expect to develop a normalized set of tables to do it all.
Start with how you want to access data and design from there.

So - if you need to get a bunch of ids fast given a flag and maybe an
id/timestamp range, and your volumes/sizes are such that the number of
shards can be kept reasonable, this might be a good design, otherwise its
crap. Drive from your own access patterns to derive your (typically
denormalized) table defs.

ml



On Fri, Mar 21, 2014 at 3:34 PM, DuyHai Doan <doanduyhai@gmail.com> wrote:

> Hello Ben
>
>  Try the following alternative with composite partition key to encode the
> dual states of the boolean:
>
>
> create table x(
>   id text,
>   flag boolean,
>   timestamp timeuuid,
>     // other fields
>   primary key (*(id,flag)* timestamp)
> )
>
> Your previous "select * from x where flag = true;"  translate into:
>
>  SELECT * FROM x WHERE id=... AND flag = true
>
> Of course, you'll need to provide the id in any case.
>
>  If you want to query only on the boolean flag, I'm afraid that manual
> indexing or secondary index (beware of cardinality !) are your only choices.
>
> Regards
>
>  Duy Hai DOAN
>
>
>
>
> On Fri, Mar 21, 2014 at 8:27 PM, Ben Hood <0x6e6562@gmail.com> wrote:
>
>> Hi,
>>
>> I was wondering what the best way is to lay column families out so
>> that you can to query by a boolean attribute.
>>
>> For example:
>>
>> create table x(
>>   id text,
>>   timestamp timeuuid,
>>   flag boolean,
>>   // other fields
>>   primary key (id, timestamp)
>> )
>>
>> So that you can query
>>
>> select * from x where flag = true;
>>
>> Three approaches spring to mind:
>>
>> 1) Put a secondary index on the flag column
>> 2) Split the column family definition out into two separate CFs, one
>> for true and one for false
>> 3) Maintain a manual index table
>> 4) Something else
>>
>> Option (1) seems to be the easiest, but I was wondering if that was
>> going to put too much load on the secondary index, given the low
>> cardinality of this attribute. It seems like what Cassandra would have
>> to do internally to achieve (1) is to effectively implement (2) or (3)
>> behind the scenes, but I'm just guessing.
>>
>> Does anybody have any experience with this?
>>
>> Cheers,
>>
>> Ben
>>
>
>

Mime
View raw message