I often suggest people think about using something like JSON for data the looks relatively unchanging, or looks like it is always worked on as a single entity for a couple of reasons.
1. Cassandra does not need to know about every atomic piece of data in your model. Obviously there are some good application reasons to store things in columns, such as TTL, slice ranges, etc etc. Blobing data was generally a bad thing to do in a RDBMS, but IMHO it's a valid option in cassandra.
2. For every column value you store in cassandra you also store the column name, timestamp and some other bytes. This is the price you pay for a schema free DB. So there can be an unexpected storage (and network) bloat if you are storing lots of small values in lots of columns. Whether you consider this expensive has to do with how much you like running ALTER TABLE statements.
3. IMHO there is little difference to code been written to detect if a cassandra row or a JSON dict does not contain a column because it was created before the last code release. Adding attributes to your entity is still a code only change and you only need to update old data if your business problem requires it.
There are also a number of reasons not to do it:
1. It does not pass your smell test.
2. You have multiple agents updating the entity with no look writes.
3. You want to pull back parts of the entity, do slices, use TTL, secondary indexes etc etc.
4. You work cross platform, use brisk/hadoop, use hive/pig and it's a pain for everyone.
I agree it's not for every situation and it probably makes sense to start coding without it to begin with. But I think it is worth considering in some cases.
Hope that helps.
Freelance Cassandra Developer
On 26 May 2011, at 02:57, openvictor Open wrote:
Sorry I didn't see your message sooner.
So the CF Messages using UTF8Type holds the information such as : who has the right to read/ is it possible to answer to this list etc... There are two "kinds" of keys. The keys which begin by : "message:uuid" and the "messagelist:uuid". A column of message:uuid is for example "sender" or "rawtext". A column of messagelist:uuid is for example : "creator" or "participants".
MessagesTime (message_time) is the sorting mechanism, meaning when I request against message_time I get messages or messagelists in the order it was sent. There are 2 kinds of keys :
"messagebox:someone" : Each Column is for the Value : the uuid of a list inside the messagebox of someone, for the Name : the uuid of the last message in the corresponding messagelist. It gives me a sorting mechanism based on the last message received.
"messagelist:uuid" : Each Column has for its Name : the UUID of a message and for the Value : whatever it doesn't really care.
About your suggestion, is a very good solution but there is one thing I don't really like with serialization : it "blocks" evolution. Let's say I would like to add one field to a message because I want to add a field, I am obliged to make a tool to deserialize, add the information reserialize all the fields and insert. Even if I serialize with JSON it looks like evolution (that is why I chose Cassandra) is a little bit broken.If I am wrong, please tell me so.
However I will explore this very interesting possibility for another project with "tags", which is not really subject to dramatic evolutions.
At the moment I don't really complain about speed and since it is not really time critical (after all who cares if the messagebox loads in 250 ms instead of 200ms). At the moment I get the messages with two batch Cassandra calls so I think this is satisfying.
Thanks again, the json serialization looks like a very interesting possibility.
2011/5/19 aaron morton <email@example.com>
I'm a bit confused by your examples. I think you are saying...
- Standard CF called Message using the UTF8Type for column comparisons used to store the individual messages. Row key is the message UUID. Not sure what the columns are.
- Standard CF called MessageTime using TimeUUIDType for columns comparison uses to store collections of messages. Row key is "messagelist:<message_list_uuid>" for a message list, and "messagebox:<user_name>:<mbox_name>" for message box. Not sure what the columns are.
The best model is going to be the one that supports your read requests and the volume of data your are expecting.
One way to go is to de normalise to support very fast read paths. You could store the entire message in one column using something like JSON to serialise it. Then
- MessageIndexes standard CF to store the full messages in context, there are three different types of rows:
* keys with <user_name> store all messages for a user, column name is the message TimeUUID and value is the message structure
* keys with <user_name>/<mbox_name> store the messages for a single message box. Columns same as below.
* keys with <user_name>/<mbox_name>/<mlist_name> store the messages in a single message list. Columns as above.
- MessageFolders CF to store the message box and message lists, two approaches:
1) <user_name> as key and each column is a message box, message lists are stored in a single column as JSON
2) <user_name> row for the top level message box, column for each message box. <user_name>/<message_box> for the next level,
Or if space is a concern just store the UUID of the message in the index CF and add a CF to store the messages.
It also going to depend on the management features, e.g. can you rename a message box / list ? Move messages around ? If so the de normalised pattern may not be the best as those operations will take longer.
Hope that helps.
Freelance Cassandra Developer
On 19 May 2011, at 05:44, openvictor Open wrote:
> Hello all,
> I know organization is a broad topic and everybody may have an idea on how to do it, but I really want to have some advices and opinions and I think it could be interesting to discuss this matter.
> Here is my problem: I am designing a messaging system internal to a website. There are 3 big structures which are Message, MessageList, MessageBox. A message/messagelist is identified only by an UUID; a MessageBox is identified by a name(utf8 string). A messagebox has a set of MessageList in it and a messagelist has a set of message in it, all of them being UUIDs.
> Currently I have only two CF : message and message_time. Message is a UTF8Type (cassandra 0.6.11, soon going for 0.8) and message_time is a TimeUUIDType.
> For example if I want to request all message in a certain messagelist I do : message_time['messagelist:uuid(messagelist)']
> If I want information of a mesasge I do message['message:uuid(message)']
> If I want all messagelist for a certain messagebox ( called nameofbox for user openvictor for this example) I do : message_time['messagebox:openvictor:nameofbox']
> My question to Cassandra users is : is it a good idea to regroup all those things into two CF ? Is there some advantages / drawbacks of this two CFs and for long term should I change my organization ?
> Thank you,