incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From acoudeyras <acoudey...@gmail.com>
Subject Data Modeling
Date Mon, 20 Feb 2012 10:49:54 GMT
Hi,

I'm new to Cassandra and i'm looking for the best way to handle my use case.

My entities look like :

customers : [{
	id: 3F2504E0-4F89-11D3-9A0C-0305E82C3301,
	firstName: "Carl",
	lastName: "Smith",
	country:"FR"
},{
	id:21EC2020-3AEA-1069-A2DD-08002B30309D,
	firstName: "John",
	lastName: "Doe"
	country:"EN"
}]

I will use the term "field" to describe a property of customer (lastName for
example).

I will have 1 millions of customers and more than 300 fields (firstName,
lastName, ...) for each customer.

I have two requirements :

- I need to retrieve all values of a field (all firstNames, all lastNames,
...).
	- The fastest the better (1 to 3 seconds)
	- It must preserve order : if i retrieve all countries and then all
lastName, the nth country and the nth lastName should correspond to the same
customer.
	- Sometimes I will have to retrieve all values of multiples fields (< 10)

- Datas will be updated (insert, delete, update), every 10 or 20 minutes in
bulk, just a small number of entities will change each time. When an update
occurs, in input I have the whole entity (a full customer with all his
fields). Performance is important, but less than in the previous case (10
seconds for updating is ok).

- Retrieving a customer by id or retrieving a list of customer with some
specific criteria is *not* a requirement.

---
Solution 1:

Column Family : customers
One row for each customer : 1 million rows
One column for each field : 300 fields by row.

Benefits : easy to update
Problem : As far as i understand, it doesn't seems to fit with cassandra
model, getting all values will be slow.

---
Solution 2:

Wide Row for the whole entity

Column Family : datas
One row : customers
Composite Columns : (fieldName, ID) = fieldValue

Customers : [{
	("country", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "FR",
	("country", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "EN",
	("firstName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Carl",
	("firstName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "John",
	("lastName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") = "Smith",
	("lastName", "21EC2020-3AEA-1069-A2DD-08002B30309D") = "Doe",
...
}]


As far as i understand it seems to be the fastest way to retrieve all values
of a field in the same order.
To update, i don't need to read before writing.

Problem : the row will be very large : 300 000 000 of columns. I can split
it in different rows based on the value of the specific field, for example
country.

---
Solution 3:

Wide Row by field 

Column Family : customers
One row by field : so 300 rows
Columns : ID = FieldValue

Benefits :
The row will be smaller, 1 000 000 colums.

Problem :
Update seems more expensive, for every customer to update, i need to update
300 rows.

---

Witch solution seems to be the good one ? Does Cassandra is really a good
fit for this use case ?

Thanks

Alexis Coudeyras

--
View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Modeling-tp7300846p7300846.html
Sent from the cassandra-user@incubator.apache.org mailing list archive at Nabble.com.

Mime
View raw message