I store documents submitted by users, with optional tags (lists of strings):
CREATE TABLE doc (
date text, // part of partition key, to distribute data better
PRIMARY KEY((user_id, date), doc_id)
What is the best way to implement tag filtering? A user can select a list of tags and get documents with the tags. I thought about:
1) Full denormalization - include tags in the primary key and insert a doc for each subset of specified tags. This will however lead to large disk space usage, because there are 2**n subsets (for 10 tags and a 1MB doc 1000MB would be written).
2) Secondary index on 'tags' collection, and using queries like:
SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags CONTAINS=? ...
Since I will supply partition key value, I assume there will be no problems with contacting multiple nodes. But how well will it work for hundreds of thousands of results? I think intersection of tag matches needs to be performed in memory so it will not scale well.
3) Partial denormalization - do inserts for each single tag and then manually compute intersection. However in the worst case it can lead to scanning almost the whole table.
4) Full denormalization but without contents. I would get correct doc_ids fast, then I would need to use '... WHERE doc_id IN ?' with potentially a very large list of doc_ids.
What's Cassandra's way to implement this?