Thanks Aaron.

 

It helped.

 

Let's me rephrase a little bit my questions. It's about data modeling impact on "batch_mutate" advantages.

 

I have one CF for storing data, and ~10 (all different) CF used for indexing that data.

 

when adding a piece of data, I need to add indexes too, and then, I need to add columns to one row for each of the 10 indexing CF => 2 main designs are possible for adding these new indexes.

 

a) all the updated 10 rows of indexing CF have different rowkeys

b) all the updated 10 rows of indexing CF have all the same rowkey

 

AFAIK, this has effect on batch_mutate:

 

a) the "batch_mutate" advantages stop at the coordinator node. The advantage appears for the communication "client<=>coordinator node"

b) the "batch_mutate" advantages are better, for the communication "client<=>coordinator node" __and__ for the communications "coordinator node<=>replicas".

 

So, for resuming:

 

a) CF with few data repeats (good) but the coordinator node needs to communicate to different replicas according to different rowkeys

b) CF with more denormalization, repeating some data, again and again over composite columns,  but batch_mutate performs better (good) up to replicas, and not only up to coordinator node.

 

Each option has one pro and one con.

 

Is there any experience out there about such data modeling (option_a vs option_b) from the batch_mutate perspective ?

Thanks.

 

Dominique

 

 

 

De : aaron morton [mailto:aaron@thelastpickle.com]
Envoyé : mardi 9 avril 2013 10:12
À : user@cassandra.apache.org
Objet : Re: data modeling from batch_mutate point of view

 

So, one alternative design for indexing CF could be:

rowkey = folder_id

colname = (indexed value, timestamp, file_id)

colvalue = ""

 

If you always search in a folder what about 

rowkey = <folder_id : property_name : property_value>

colname = <file_id>

 

(That's closer to secondary indexes in cassandra with the addition of the folder_id)

 

According to pro vs con, is the alternative design more or less interesting ?

IMHO it's normally better to spread the rows and consider how they grow over time. 

You can send updates for multiple rows in the same batch mutation. 

 

Hope that helps. 

 

-----------------

Aaron Morton

Freelance Cassandra Consultant

New Zealand

 

@aaronmorton

 

On 9/04/2013, at 3:57 AM, DE VITO Dominique <dominique.devito@thalesgroup.com> wrote:



Hi,

 

I have a use case that sounds like storing data associated with files. So, I store them with the CF:

rowkey = (folder_id, file_id)

colname = property name (about the file corresponding to file_id)

colvalue = property value

 

And I have CF for "manual" indexing:

rowkey = (folder_id, indexed value)

colname = (timestamp, file_id)

colvalue = ""

 

like

rowkey = (folder_id, note_of_5) or (folder_id, some_status)

colname = (some_date, some_filename)

colvalue = ""

 

I have many CF for indexing, as I index according to different (file) properties.

 

So, one alternative design for indexing CF could be:

rowkey = folder_id

colname = (indexed value, timestamp, file_id)

colvalue = ""

 

Alternative design :

* pro: same rowkey for all indexing CF => **all** indexing CF could be updated through one batch_mutate

* con: repeating "indexed value" (1er colname part) again ang again (= a string up to 20c)

 

According to pro vs con, is the alternative design more or less interesting ?

 

Thanks.

 

Dominique