On Mon, Feb 24, 2014 at 11:47 AM, Sylvain Lebresne <sylvain@datastax.com> wrote:
 

I still have some questions regarding the mapping. Please bear with me if these are stupid questions. I am quite new to Cassandra.

The basic cassandra data model for a keyspace is something like this, right?

SortedMap<byte[], SortedMap<byte[], Pair<Long, byte[]>>
                 ^ row key. determines which server(s) the rest is stored on
                                             ^ column key
                                                               ^ timestamp (latest one wins)
                                                                        ^ value (can be size 0)

It's a reasonable way to think of how things are stored internally, yes. Though as DuyHai mentioned, the first map is really sorting by token and in general that means you use mostly the sorting of the second map concretely.
 
Yes, understood.

So the first SortedMap is sorted on some kind of hash of the actual key to make sure the data gets evenly distributed along the nodes? What if my key is already a good hash: is there a way to use an identity function as a hash function (in CQL)?

It's possible, yes. The hash function we're talking about is what Cassandra calls "the partitioner". You configure the partitioner in the yaml config file and there is one partitioner, ByteOrderedPartitioner, that is basically the identify function.
We however usually discourage user for using it because the partitioner is global to a cluster and cannot be changed (you basically pick it at cluster creation time and are stuck with it until the end of time), and since ByteOrderedPartitioner can easily lead to hotspot in the data distribution if you're not careful...For those reasons, the default partitioner is also much more tested, and I can't remember anyone mentioning the partitioner has been a bottleneck.

Thanks for the info. I thought that this might be possible to adjust on a per-keyspace level.

But if you can only do this globally, then I will leave it alone. Other than the (probably negibile) performance impact of hashing the hash again, there is nothing wrong with doing so. Hashing a SHA1-hash will give a good distribution.

anyway, this is getting a bit off-topic.

cheers,

Rüdiger