cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From alexis coudeyras <acoudey...@gmail.com>
Subject Re: Data Modeling
Date Mon, 20 Feb 2012 21:17:44 GMT
Thanks a lot Aaron,

I will try your idea tomorow.

For CF PropertyValues, instead of <property_value:customer_id> should I do <customer_id:property_value>
to preserve the same order for each property_value ? (there will be custom null value).

Why is using only columns names faster ? It seems that it's not possible to retrieve column
names without column values in Hector for example, so even after reading your article (great
by the way), i don't get it.


Le 20 févr. 2012 à 20:41, aaron morton a écrit :

> If you want to read all possible values for a field, where the field has 1 million possible
values it's going to take time. No matter what data model you use. 
> 
> That said, the first model I would use is:
> 
> CF: Customer
> Use this as a canonical record of the properties a customer has. 
> row_key : <customer_id>
> cols: <property_name> = <property_value>
> 
> CF: PropertyValues
> Use this to perform to build the reverse index. Column names are a composite value of
property value and customer ID.
> row_key: <property_name>
> cols: <property_value:customer_id> = EMPTY
> 
> * To Insert: It is good if you can work out the delta. Just update what you need to in
the customer, delete the old values from the PropertyValues CF and insert the new ones. Note:
I would insert when you get the new data, 
> 
> * To Read:
>>   - I need to retrieve all values of a field (all firstNames, all lastNames,
> Get all the values from the appropriate row. 
>> 	- The fastest the better (1 to 3 seconds)
> Things take time http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
>> 	- It must preserve order : if i retrieve all countries and then all
>> lastName, the nth country and the nth lastName should correspond to the same
>> customer.
> Can only be guaranteed if every customer has a value for every field. Or if you use a
custom null value. 
>> 	- Sometimes I will have to retrieve all values of multiples fields (< 10)
> There is no provision for server side joins. If you have a query you use often it is
best to materialise the result .
> 
> Hope that helps. 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/02/2012, at 11:49 PM, acoudeyras wrote:
> 
>> Hi,
>> 
>> I'm new to Cassandra and i'm looking for the best way to handle my use case.
>> 
>> My entities look like :
>> 
>> customers : [{
>> 	id: 3F2504E0-4F89-11D3-9A0C-0305E82C3301,
>> 	firstName: "Carl",
>> 	lastName: "Smith",
>> 	country:"FR"
>> },{
>> 	id:21EC2020-3AEA-1069-A2DD-08002B30309D,
>> 	firstName: "John",
>> 	lastName: "Doe"
>> 	country:"EN"
>> }]
>> 
>> I will use the term "field" to describe a property of customer (lastName for
>> example).
>> 
>> I will have 1 millions of customers and more than 300 fields (firstName,
>> lastName, ...) for each customer.
>> 
>> I have two requirements :
>> 
>> - I need to retrieve all values of a field (all firstNames, all lastNames,
>> ...).
>> 	- The fastest the better (1 to 3 seconds)
>> 	- It must preserve order : if i retrieve all countries and then all
>> lastName, the nth country and the nth lastName should correspond to the same
>> customer.
>> 	- Sometimes I will have to retrieve all values of multiples fields (< 10)
>> 
>> - Datas will be updated (insert, delete, update), every 10 or 20 minutes in
>> bulk, just a small number of entities will change each time. When an update
>> occurs, in input I have the whole entity (a full customer with all his
>> fields). Performance is important, but less than in the previous case (10
>> seconds for updating is ok).
>> 
>> - Retrieving a customer by id or retrieving a list of customer with some
>> specific criteria is *not* a requirement.
>> 
>> ---
>> Solution 1:
>> 
>> Column Family : customers
>> One row for each customer : 1 million rows
>> One column for each field : 300 fields by row.
>> 
>> Benefits : easy to update
>> Problem : As far as i understand, it doesn't seems to fit with cassandra
>> model, getting all values will be slow.
>> 
>> ---
>> Solution 2:
>> 
>> Wide Row for the whole entity
>> 
>> Column Family : datas
>> One row : customers
>> Composite Columns : (fieldName, ID) = fieldValue
>> 
>> Customers : [{
>> 	("country", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "FR",
>> 	("country", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "EN",
>> 	("firstName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Carl",
>> 	("firstName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "John",
>> 	("lastName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Smith",
>> 	("lastName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "Doe",
>> ...
>> }]
>> 
>> 
>> As far as i understand it seems to be the fastest way to retrieve all values
>> of a field in the same order.
>> To update, i don't need to read before writing.
>> 
>> Problem : the row will be very large : 300 000 000 of columns. I can split
>> it in different rows based on the value of the specific field, for example
>> country.
>> 
>> ---
>> Solution 3:
>> 
>> Wide Row by field 
>> 
>> Column Family : customers
>> One row by field : so 300 rows
>> Columns : ID = FieldValue
>> 
>> Benefits :
>> The row will be smaller, 1 000 000 colums.
>> 
>> Problem :
>> Update seems more expensive, for every customer to update, i need to update
>> 300 rows.
>> 
>> ---
>> 
>> Witch solution seems to be the good one ? Does Cassandra is really a good
>> fit for this use case ?
>> 
>> Thanks
>> 
>> Alexis Coudeyras
>> 
>> --
>> View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-tp7300846p7300846.html
>> Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.
> 


Mime
View raw message