incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From aaron morton <aa...@thelastpickle.com>
Subject Re: data modeling from batch_mutate point of view
Date Thu, 11 Apr 2013 10:28:56 GMT
> b) the "batch_mutate" advantages are better, for the communication "client<=>coordinator
node" __and__ for the communications "coordinator node<=>replicas".
Yes. A single row mutation can write to many CFs. 

> Is there any experience out there about such data modeling (option_a vs option_b) from
the batch_mutate perspective ?
> Thanks.
I would not worry about the internal network lag as much as creating hot rows in the model.
Sometimes it makes sense for an entity to map to rows in several CF's that use the same key,
e.g. user info or a blog post. However it is normally bad when many entities require storing
data on the same row, e.g. all blog posts have to update one row. 

From my understanding of what you are doing I would look to spread out the index entries to
use different row keys. If the indexes are small you may get away with using the same key,
but I would start with spreading it out. 

Cheers
 
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 10/04/2013, at 2:27 AM, DE VITO Dominique <dominique.devito@thalesgroup.com> wrote:

> Thanks Aaron.
>  
> It helped.
>  
> Let's me rephrase a little bit my questions. It's about data modeling impact on "batch_mutate"
advantages.
>  
> I have one CF for storing data, and ~10 (all different) CF used for indexing that data.
>  
> when adding a piece of data, I need to add indexes too, and then, I need to add columns
to one row for each of the 10 indexing CF => 2 main designs are possible for adding these
new indexes.
>  
> a) all the updated 10 rows of indexing CF have different rowkeys
> b) all the updated 10 rows of indexing CF have all the same rowkey
>  
> AFAIK, this has effect on batch_mutate:
>  
> a) the "batch_mutate" advantages stop at the coordinator node. The advantage appears
for the communication "client<=>coordinator node"
> b) the "batch_mutate" advantages are better, for the communication "client<=>coordinator
node" __and__ for the communications "coordinator node<=>replicas".
>  
> So, for resuming:
>  
> a) CF with few data repeats (good) but the coordinator node needs to communicate to different
replicas according to different rowkeys
> b) CF with more denormalization, repeating some data, again and again over composite
columns,  but batch_mutate performs better (good) up to replicas, and not only up to coordinator
node.
>  
> Each option has one pro and one con.
>  
> Is there any experience out there about such data modeling (option_a vs option_b) from
the batch_mutate perspective ?
> Thanks.
>  
> Dominique
>  
>  
>  
> De : aaron morton [mailto:aaron@thelastpickle.com] 
> Envoyé : mardi 9 avril 2013 10:12
> À : user@cassandra.apache.org
> Objet : Re: data modeling from batch_mutate point of view
>  
> So, one alternative design for indexing CF could be:
> rowkey = folder_id
> colname = (indexed value, timestamp, file_id)
> colvalue = ""
>  
> If you always search in a folder what about 
> rowkey = <folder_id : property_name : property_value>
> colname = <file_id>
>  
> (That's closer to secondary indexes in cassandra with the addition of the folder_id)
>  
> According to pro vs con, is the alternative design more or less interesting ?
> IMHO it's normally better to spread the rows and consider how they grow over time. 
> You can send updates for multiple rows in the same batch mutation. 
>  
> Hope that helps. 
>  
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>  
> @aaronmorton
> http://www.thelastpickle.com
>  
> On 9/04/2013, at 3:57 AM, DE VITO Dominique <dominique.devito@thalesgroup.com>
wrote:
> 
> 
> Hi,
>  
> I have a use case that sounds like storing data associated with files. So, I store them
with the CF:
> rowkey = (folder_id, file_id)
> colname = property name (about the file corresponding to file_id)
> colvalue = property value
>  
> And I have CF for "manual" indexing:
> rowkey = (folder_id, indexed value)
> colname = (timestamp, file_id)
> colvalue = ""
>  
> like
> rowkey = (folder_id, note_of_5) or (folder_id, some_status)
> colname = (some_date, some_filename)
> colvalue = ""
>  
> I have many CF for indexing, as I index according to different (file) properties.
>  
> So, one alternative design for indexing CF could be:
> rowkey = folder_id
> colname = (indexed value, timestamp, file_id)
> colvalue = ""
>  
> Alternative design :
> * pro: same rowkey for all indexing CF => **all** indexing CF could be updated through
one batch_mutate
> * con: repeating "indexed value" (1er colname part) again ang again (= a string up to
20c)
>  
> According to pro vs con, is the alternative design more or less interesting ?
>  
> Thanks.
>  
> Dominique
>  
>  


Mime
View raw message