From Christian Decker <>
Subject Join & Range Query performance
Date Sat, 28 Aug 2010 15:45:09 GMT
I'm wondering what the performance considerations are on Join-like queries.

I have a ColumnFamily that holds millions of records (not unusual as I
understand) and I want to work on them using Pig and Hadoop. Until now we
always fetched all rows in Cassandra and just filtered and worked on them.
The idea now is to introduce indices to speed up some of these analysis.
Let's assume we have page hits, each of them has a user associated and many
of our queries work on the users, so creating a ColumnFamily whose key is
the user id would be logic, but that would mean that we'd store all the data
twice (once in the all encompassing ColumnFamily and once as
SubcolumnFamilies in the Index) and since we might insert additional indices
it would multiply our data size.

Usually in a relational world we'd not save the data in the index, but a
pointer to the real entry. Would it be wise to just store the key of the
item that is referenced and then iteratively fetch them from the cluster?

Also I'd like to know how key range queries perform against simple key
lookups since I'd like to build a dynamic storage system which splits really
large rows into smaller ones, by specifying one more byte of the key (so
from a\0\0\0\0 we might got to a\0\0\0\0 - a\255\0\0\0, and then get all
results by simply querying a\0\0\0\0 through a\255\255\255\255).
I have no idea if this is even possible, just playing around with some ideas


