cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Artur Siekielski <>
Subject Tag filtering data model
Date Fri, 11 Sep 2015 07:53:17 GMT
I store documents submitted by users, with optional tags (lists of strings):

   user_id uuid,
   date text, // part of partition key, to distribute data better
   doc_id uuid,
   tags list<text>,
   contents text,
   PRIMARY KEY((user_id, date), doc_id)

What is the best way to implement tag filtering? A user can select a 
list of tags and get documents with the tags. I thought about:

1) Full denormalization - include tags in the primary key and insert a 
doc for each subset of specified tags. This will however lead to large 
disk space usage, because there are 2**n subsets (for 10 tags and a 1MB 
doc 1000MB would be written).

2) Secondary index on 'tags' collection, and using queries like:
SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND 
tags CONTAINS=? ...

Since I will supply partition key value, I assume there will be no 
problems with contacting multiple nodes. But how well will it work for 
hundreds of thousands of results? I think intersection of tag matches 
needs to be performed in memory so it will not scale well.

3) Partial denormalization - do inserts for each single tag and then 
manually compute intersection. However in the worst case it can lead to 
scanning almost the whole table.

4) Full denormalization but without contents. I would get correct 
doc_ids fast, then I would need to use '... WHERE doc_id IN ?' with 
potentially a very large list of doc_ids.

What's Cassandra's way to implement this?

View raw message