cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: Recommandation on how to organize CF
Date Sun, 29 May 2011 11:20:57 GMT
I often suggest people think about using something like JSON for data the looks relatively
unchanging, or looks like it is always worked on as a single entity for a couple of reasons.

1. Cassandra does not need to know about every atomic piece of data in your model. Obviously
there are some good application reasons to store things in columns, such as TTL, slice ranges,
etc etc. Blobing data was generally a bad thing to do in a RDBMS, but IMHO it's a valid option
in cassandra. 
 2. For every column value you store in cassandra you also store the column name, timestamp
and some other bytes. This is the price you pay for a schema free DB. So there can be an unexpected
storage (and network) bloat if you are storing lots of small values in lots of columns. Whether
you consider this expensive has to do with how much you like running ALTER TABLE statements.
 3. IMHO there is little difference to code been written to detect if a cassandra row or a
JSON dict does not contain a column because it was created before the last code release. Adding
attributes to your entity is still a code only change and you only need to update old data
if your business problem requires it.

There are also a number of reasons not to do it:

1. It does not pass your smell test. 
2. You have multiple agents updating the entity with no look writes.
3. You want to pull back parts of the entity, do slices, use TTL, secondary indexes etc etc.

4. You work cross platform, use brisk/hadoop, use hive/pig and it's a pain for everyone. 

I agree it's not for every situation and it probably makes sense to start coding without it
to begin with. But I think it is worth considering in some cases. 

Hope that helps. 

-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 26 May 2011, at 02:57, openvictor Open wrote:

> Thanks Aaron,
> 
> Sorry I didn't see your message sooner.
> 
> So the CF Messages using UTF8Type holds the  information such as : who has the right
to read/ is it possible to answer to this list etc... There are two "kinds" of keys. The keys
which begin by : "message:uuid" and the "messagelist:uuid". A column of message:uuid is for
example "sender" or "rawtext". A column of messagelist:uuid is for example : "creator" or
"participants".
> 
> 
> MessagesTime (message_time) is the sorting mechanism, meaning when I request against
message_time I get messages or messagelists in the order it was sent. There are 2 kinds of
keys :
> "messagebox:someone" : Each Column is for the Value : the uuid of a list inside the messagebox
of someone, for the Name : the uuid of the last message in the corresponding messagelist.
It gives me a sorting mechanism based on the last message received.
> "messagelist:uuid" : Each Column has for its Name : the UUID of a message and for the
Value : whatever it doesn't really care.
> 
> About your suggestion, is a very good solution but there is one thing I don't really
like with serialization : it "blocks" evolution. Let's say I would like to add one field to
a message because I want to add a field, I am obliged to make a tool to deserialize, add the
information  reserialize all the fields and insert. Even if I serialize with JSON it looks
like evolution (that is why I chose Cassandra) is a little bit broken.If I am wrong, please
tell me so. 
> However I will explore this very interesting possibility for another project with "tags",
which is not really subject to dramatic evolutions.
> 
> At the moment I don't really complain about speed and since it is not really time critical
(after all who cares if the messagebox loads in 250 ms instead of 200ms). At the moment I
get the messages with two batch Cassandra calls so I think this is satisfying.
> 
> Thanks again, the json serialization looks like a very interesting possibility.
> 
> Victor
> 
> 2011/5/19 aaron morton <aaron@thelastpickle.com>
> I'm a bit confused by your examples. I think you are saying...
> 
> - Standard CF called Message using the UTF8Type for column comparisons used to store
the individual messages. Row key is the message UUID. Not sure what the columns are.
> - Standard CF called MessageTime using TimeUUIDType for columns comparison uses to store
collections of messages. Row key is "messagelist:<message_list_uuid>" for a message
list, and "messagebox:<user_name>:<mbox_name>" for message box. Not sure what
the columns are.
> 
> The best model is going to be the one that supports your read requests and the volume
of data your are expecting.
> 
> One way to go is to de normalise to support very fast read paths. You could store the
entire message in one column using something like JSON to serialise it. Then
> 
> - MessageIndexes standard CF to store the full messages in context, there are three different
types of rows:
>        * keys with <user_name>  store all messages for a user, column name is the
message TimeUUID and value is the message structure
>        * keys with <user_name>/<mbox_name> store the messages for a single
message box. Columns same as below.
>        * keys with <user_name>/<mbox_name>/<mlist_name> store the messages
in a single message list. Columns as above.
> 
> - MessageFolders CF to store the message box and message lists, two approaches:
>        1) <user_name> as key and each column is a message box, message lists are
stored in a single column as JSON
>        2) <user_name> row for the top level message box, column for each message
box. <user_name>/<message_box> for the next level,
> 
> Or if space is a concern just store the UUID of the message in the index CF and add a
CF to store the messages.
> 
> It also going to depend on the management features, e.g. can you rename a message box
/ list ? Move messages around ? If so the de normalised pattern may not be the best as those
operations will take longer.
> 
> Hope that helps.
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 19 May 2011, at 05:44, openvictor Open wrote:
> 
> > Hello all,
> >
> > I know organization is a broad topic and everybody may have an idea on how to do
it, but I really want to have some advices and opinions and I think it could be interesting
to discuss this matter.
> >
> > Here is my problem: I am designing a messaging system internal to a website. There
are 3 big structures which are Message, MessageList, MessageBox. A message/messagelist is
identified only by an UUID; a MessageBox is identified by a name(utf8 string). A messagebox
has a set of MessageList in it and a messagelist has a set of message in it, all of them being
UUIDs.
> > Currently I have only two CF : message and message_time. Message is a UTF8Type (cassandra
0.6.11, soon going for 0.8) and message_time is a TimeUUIDType.
> >
> > For example if I want to request all message in a certain messagelist I do : message_time['messagelist:uuid(messagelist)']
> > If I want information of a mesasge I do message['message:uuid(message)']
> > If I want all messagelist for a certain messagebox ( called nameofbox for user openvictor
for this example) I do : message_time['messagebox:openvictor:nameofbox']
> >
> > My question to Cassandra users is : is it a good idea to regroup all those things
into two CF ? Is there some advantages / drawbacks of this two CFs and for long term should
I change my organization ?
> >
> > Thank you,
> > Victor
> 
> 


Mime
View raw message