Really interesting question Artur. Have you gone any further?

I think, based on my experience and recalling Cassandra's good practices, that full denormalisation is the Cassandra way to go.

Cheers

Carlos Alonso | Software Engineer | @calonso

On 11 September 2015 at 08:53, Artur Siekielski <artc@vhex.net> wrote:
I store documents submitted by users, with optional tags (lists of strings):

CREATE TABLE doc (
  user_id uuid,
  date text, // part of partition key, to distribute data better
  doc_id uuid,
  tags list<text>,
  contents text,
  PRIMARY KEY((user_id, date), doc_id)
);

What is the best way to implement tag filtering? A user can select a list of tags and get documents with the tags. I thought about:

1) Full denormalization - include tags in the primary key and insert a doc for each subset of specified tags. This will however lead to large disk space usage, because there are 2**n subsets (for 10 tags and a 1MB doc 1000MB would be written).

2) Secondary index on 'tags' collection, and using queries like:
SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags CONTAINS=? ...

Since I will supply partition key value, I assume there will be no problems with contacting multiple nodes. But how well will it work for hundreds of thousands of results? I think intersection of tag matches needs to be performed in memory so it will not scale well.

3) Partial denormalization - do inserts for each single tag and then manually compute intersection. However in the worst case it can lead to scanning almost the whole table.

4) Full denormalization but without contents. I would get correct doc_ids fast, then I would need to use '... WHERE doc_id IN ?' with potentially a very large list of doc_ids.


What's Cassandra's way to implement this?