I guess the subject talks for itself.
I'm currently developing a document analysis engine using cassandra as the scalable storage.
I just want to briefly make an overview of the data model I'm using for this purpose.
"the key" is formed in the format of timestamp.random(), so that it'll be sorted on the Chronological order.
so I have out-of-box range queries based on timestamps.
But I still need to index some values:
I started testing with three types of fields in the Document ColumnFamily
- fields containing text (several words) : (every word is an index term)
- fields containing positive integers : (zero padded integer is the index term)
- fields containing enumeration : (value itself is the index term)
For indexing purposes I used another ColumnFamily called IndexCF; the key is formed in the format of "field_name||index_term", where values are the actual references to the keys in Documents ColumnFamily.
After searching the projects related to indexing in cassandra, I've come up with Lucandra.
I've recently been running tests with Lucandra since then (http://github.com/tjake/Lucandra) for indexing those type of columns, it's basically using a similar approach.
Lucandra works fine for indexing the columns containing text values, zero padded integers and range queries on integers also work fine too.
However, the enumeration indexing is a really big problem.
Say we have 1M documents, with the type field which can have 4 values (book, magazine, newspaper, other). Assuming the values are distributed equally, each "field_name||index_term" pair would have 250K related documents. When we try to index with respect to this distribution, We'll end up with only 4 index keys each one of them containing 250k columns. This basically means it's not reasonable to index and search with respect to the enumeration fields.
I wrote all these in a hurry, I hope I was able to express what I'm opening for discussion. Can you think of a better implementation for indexing enumeration in cassandra?