incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DE VITO Dominique <dominique.dev...@thalesgroup.com>
Subject RE: data modeling from batch_mutate point of view
Date Tue, 09 Apr 2013 14:27:03 GMT
Thanks Aaron.

It helped.

Let's me rephrase a little bit my questions. It's about data modeling impact on "batch_mutate"
advantages.

I have one CF for storing data, and ~10 (all different) CF used for indexing that data.

when adding a piece of data, I need to add indexes too, and then, I need to add columns to
one row for each of the 10 indexing CF => 2 main designs are possible for adding these
new indexes.

a) all the updated 10 rows of indexing CF have different rowkeys
b) all the updated 10 rows of indexing CF have all the same rowkey

AFAIK, this has effect on batch_mutate:

a) the "batch_mutate" advantages stop at the coordinator node. The advantage appears for the
communication "client<=>coordinator node"
b) the "batch_mutate" advantages are better, for the communication "client<=>coordinator
node" __and__ for the communications "coordinator node<=>replicas".

So, for resuming:

a) CF with few data repeats (good) but the coordinator node needs to communicate to different
replicas according to different rowkeys
b) CF with more denormalization, repeating some data, again and again over composite columns,
 but batch_mutate performs better (good) up to replicas, and not only up to coordinator node.

Each option has one pro and one con.

Is there any experience out there about such data modeling (option_a vs option_b) from the
batch_mutate perspective ?
Thanks.

Dominique



De : aaron morton [mailto:aaron@thelastpickle.com]
Envoyé : mardi 9 avril 2013 10:12
À : user@cassandra.apache.org
Objet : Re: data modeling from batch_mutate point of view

So, one alternative design for indexing CF could be:
rowkey = folder_id
colname = (indexed value, timestamp, file_id)
colvalue = ""

If you always search in a folder what about
rowkey = <folder_id : property_name : property_value>
colname = <file_id>

(That's closer to secondary indexes in cassandra with the addition of the folder_id)

According to pro vs con, is the alternative design more or less interesting ?
IMHO it's normally better to spread the rows and consider how they grow over time.
You can send updates for multiple rows in the same batch mutation.

Hope that helps.

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 9/04/2013, at 3:57 AM, DE VITO Dominique <dominique.devito@thalesgroup.com<mailto:dominique.devito@thalesgroup.com>>
wrote:


Hi,

I have a use case that sounds like storing data associated with files. So, I store them with
the CF:
rowkey = (folder_id, file_id)
colname = property name (about the file corresponding to file_id)
colvalue = property value

And I have CF for "manual" indexing:
rowkey = (folder_id, indexed value)
colname = (timestamp, file_id)
colvalue = ""

like
rowkey = (folder_id, note_of_5) or (folder_id, some_status)
colname = (some_date, some_filename)
colvalue = ""

I have many CF for indexing, as I index according to different (file) properties.

So, one alternative design for indexing CF could be:
rowkey = folder_id
colname = (indexed value, timestamp, file_id)
colvalue = ""

Alternative design :
* pro: same rowkey for all indexing CF => **all** indexing CF could be updated through
one batch_mutate
* con: repeating "indexed value" (1er colname part) again ang again (= a string up to 20c)

According to pro vs con, is the alternative design more or less interesting ?

Thanks.

Dominique




Mime
View raw message