Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AF59B91FF for ; Tue, 21 Feb 2012 08:08:41 +0000 (UTC) Received: (qmail 89384 invoked by uid 500); 21 Feb 2012 08:08:39 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 89310 invoked by uid 500); 21 Feb 2012 08:08:38 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 89299 invoked by uid 99); 21 Feb 2012 08:08:38 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2012 08:08:38 +0000 X-ASF-Spam-Status: No, hits=3.5 required=5.0 tests=FSL_RCVD_USER,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a50.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 21 Feb 2012 08:08:32 +0000 Received: from homiemail-a50.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a50.g.dreamhost.com (Postfix) with ESMTP id BDDFC6F8060 for ; Tue, 21 Feb 2012 00:08:06 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=xm8L5MEhxL 5lQvECuKWBLON6iYIs7SChYNTMj5rWZ8rV4VEdZxP9SH0/7W6Ig9XA/LOyMOkM4G e+wcKtwUmR2nMXzP3TxozOqSu857oiJRFvum4qhfxkuM9BxDWKL2WTrztoJ/Q/SI muRRuOObQYb1c+DaLIWcWCK/JeKNiK18Q= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=CmWFAFtDVFrpMzYb ShMxprM1Q7o=; b=b+zMyPEon9UJCEgHUxPWGAHS2QXAFqCk26wfS5U9tmMx4YLo a4JeAcVfOPlBKq5WuxFuOd9BdtMbf0FxJ/8DRz6YLjYCrgaAlGFesL1ze8pB/Pc5 NqsUywbW5ZaLiiEa1CDG6sRKy67HmnCLfTwEjlPDuv58qixp+w0ihsDldMk= Received: from [172.16.1.3] (125-236-193-159.adsl.xtra.co.nz [125.236.193.159]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a50.g.dreamhost.com (Postfix) with ESMTPSA id 001466F8058 for ; Tue, 21 Feb 2012 00:08:05 -0800 (PST) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1257) Content-Type: multipart/alternative; boundary="Apple-Mail=_BE78C827-4AA3-434C-A42E-F4FD310E59E4" Subject: Re: Data Modeling Date: Tue, 21 Feb 2012 21:08:01 +1300 In-Reply-To: <0EA361E3-A99B-40FA-9938-D5EB5DA23EEF@gmail.com> To: user@cassandra.apache.org References: <1329734994547-7300846.post@n2.nabble.com> <3C0D4074-C4CD-46B6-912A-A19DAAF86040@thelastpickle.com> <0EA361E3-A99B-40FA-9938-D5EB5DA23EEF@gmail.com> Message-Id: X-Mailer: Apple Mail (2.1257) --Apple-Mail=_BE78C827-4AA3-434C-A42E-F4FD310E59E4 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 > For CF PropertyValues, instead of should = I do to preserve the same order for each = property_value ? (there will be custom null value). Whatever works best for you.=20 > Why is using only columns names faster ? It seems that it's not = possible to retrieve column names without column values in Hector for = example, so even after reading your article (great by the way), i don't = get it. Not sure what you mean.=20 Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 21/02/2012, at 10:17 AM, alexis coudeyras wrote: > Thanks a lot Aaron, >=20 > I will try your idea tomorow. >=20 > For CF PropertyValues, instead of should = I do to preserve the same order for each = property_value ? (there will be custom null value). >=20 > Why is using only columns names faster ? It seems that it's not = possible to retrieve column names without column values in Hector for = example, so even after reading your article (great by the way), i don't = get it. >=20 >=20 > Le 20 f=E9vr. 2012 =E0 20:41, aaron morton a =E9crit : >=20 >> If you want to read all possible values for a field, where the field = has 1 million possible values it's going to take time. No matter what = data model you use.=20 >>=20 >> That said, the first model I would use is: >>=20 >> CF: Customer >> Use this as a canonical record of the properties a customer has.=20 >> row_key : >> cols: =3D >>=20 >> CF: PropertyValues >> Use this to perform to build the reverse index. Column names are a = composite value of property value and customer ID. >> row_key: >> cols: =3D EMPTY >>=20 >> * To Insert: It is good if you can work out the delta. Just update = what you need to in the customer, delete the old values from the = PropertyValues CF and insert the new ones. Note: I would insert when you = get the new data,=20 >>=20 >> * To Read: >>> - I need to retrieve all values of a field (all firstNames, all = lastNames, >> Get all the values from the appropriate row.=20 >>> - The fastest the better (1 to 3 seconds) >> Things take time = http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ >>> - It must preserve order : if i retrieve all countries and then = all >>> lastName, the nth country and the nth lastName should correspond to = the same >>> customer. >> Can only be guaranteed if every customer has a value for every field. = Or if you use a custom null value.=20 >>> - Sometimes I will have to retrieve all values of multiples = fields (< 10) >> There is no provision for server side joins. If you have a query you = use often it is best to materialise the result . >>=20 >> Hope that helps.=20 >>=20 >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >>=20 >> On 20/02/2012, at 11:49 PM, acoudeyras wrote: >>=20 >>> Hi, >>>=20 >>> I'm new to Cassandra and i'm looking for the best way to handle my = use case. >>>=20 >>> My entities look like : >>>=20 >>> customers : [{ >>> id: 3F2504E0-4F89-11D3-9A0C-0305E82C3301, >>> firstName: "Carl", >>> lastName: "Smith", >>> country:"FR" >>> },{ >>> id:21EC2020-3AEA-1069-A2DD-08002B30309D, >>> firstName: "John", >>> lastName: "Doe" >>> country:"EN" >>> }] >>>=20 >>> I will use the term "field" to describe a property of customer = (lastName for >>> example). >>>=20 >>> I will have 1 millions of customers and more than 300 fields = (firstName, >>> lastName, ...) for each customer. >>>=20 >>> I have two requirements : >>>=20 >>> - I need to retrieve all values of a field (all firstNames, all = lastNames, >>> ...). >>> - The fastest the better (1 to 3 seconds) >>> - It must preserve order : if i retrieve all countries and then = all >>> lastName, the nth country and the nth lastName should correspond to = the same >>> customer. >>> - Sometimes I will have to retrieve all values of multiples = fields (< 10) >>>=20 >>> - Datas will be updated (insert, delete, update), every 10 or 20 = minutes in >>> bulk, just a small number of entities will change each time. When an = update >>> occurs, in input I have the whole entity (a full customer with all = his >>> fields). Performance is important, but less than in the previous = case (10 >>> seconds for updating is ok). >>>=20 >>> - Retrieving a customer by id or retrieving a list of customer with = some >>> specific criteria is *not* a requirement. >>>=20 >>> --- >>> Solution 1: >>>=20 >>> Column Family : customers >>> One row for each customer : 1 million rows >>> One column for each field : 300 fields by row. >>>=20 >>> Benefits : easy to update >>> Problem : As far as i understand, it doesn't seems to fit with = cassandra >>> model, getting all values will be slow. >>>=20 >>> --- >>> Solution 2: >>>=20 >>> Wide Row for the whole entity >>>=20 >>> Column Family : datas >>> One row : customers >>> Composite Columns : (fieldName, ID) =3D fieldValue >>>=20 >>> Customers : [{ >>> ("country", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") =3D "FR", >>> ("country", "21EC2020-3AEA-1069-A2DD-08002B30309D") =3D "EN", >>> ("firstName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") =3D = "Carl", >>> ("firstName", "21EC2020-3AEA-1069-A2DD-08002B30309D") =3D = "John", >>> ("lastName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") =3D = "Smith", >>> ("lastName", "21EC2020-3AEA-1069-A2DD-08002B30309D") =3D "Doe", >>> ... >>> }] >>>=20 >>>=20 >>> As far as i understand it seems to be the fastest way to retrieve = all values >>> of a field in the same order. >>> To update, i don't need to read before writing. >>>=20 >>> Problem : the row will be very large : 300 000 000 of columns. I can = split >>> it in different rows based on the value of the specific field, for = example >>> country. >>>=20 >>> --- >>> Solution 3: >>>=20 >>> Wide Row by field=20 >>>=20 >>> Column Family : customers >>> One row by field : so 300 rows >>> Columns : ID =3D FieldValue >>>=20 >>> Benefits : >>> The row will be smaller, 1 000 000 colums. >>>=20 >>> Problem : >>> Update seems more expensive, for every customer to update, i need to = update >>> 300 rows. >>>=20 >>> --- >>>=20 >>> Witch solution seems to be the good one ? Does Cassandra is really a = good >>> fit for this use case ? >>>=20 >>> Thanks >>>=20 >>> Alexis Coudeyras >>>=20 >>> -- >>> View this message in context: = http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Data-Mode= ling-tp7300846p7300846.html >>> Sent from the cassandra-user@incubator.apache.org mailing list = archive at Nabble.com. >>=20 >=20 --Apple-Mail=_BE78C827-4AA3-434C-A42E-F4FD310E59E4 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1
Whatever works best for = you. 

Why is using only columns = names faster ? It seems that it's not possible to retrieve column names = without column values in Hector for example, so even after reading your = article (great by the way), i don't get = it.
Not sure what you = mean. 

Cheers

http://www.thelastpickle.com

On 21/02/2012, at 10:17 AM, alexis coudeyras = wrote:

Thanks a lot = Aaron,

I will try your idea = tomorow.

For CF PropertyValues, instead of = <property_value:customer_id> should I = do <customer_id:property_value> to preserve the same order = for each property_value ? (there will be custom null = value).

Why is using only columns names faster = ? It seems that it's not possible to retrieve column names without = column values in Hector for example, so even after reading your article = (great by the way), i don't get = it.


Le 20 f=E9vr. 2012 = =E0 20:41, aaron morton a =E9crit :

If you want to read all = possible values for a field, where the field has 1 million possible = values it's going to take time. No matter what data model you = use. 

That said, the first model I would = use is:

CF: Customer
Use this as a = canonical record of the properties a customer = has. 
row_key : <customer_id>
cols: = <property_name> =3D = <property_value>

CF: = PropertyValues
Use this to perform to build the reverse index. = Column names are a composite value of property value and customer = ID.
row_key: <property_name>
cols: = <property_value:customer_id> =3D EMPTY

* = To Insert: It is good if you can work out the delta. Just update what = you need to in the customer, delete the old values from the = PropertyValues CF and insert the new ones. Note: I would insert when you = get the new data, 

* To = Read:
  - I = need to retrieve all values of a field (all firstNames, all = lastNames,
Get all the values from the = appropriate row. 
- The = fastest the better (1 to 3 seconds)
Things take = time http:/= /thelastpickle.com/2011/07/04/Cassandra-Query-Plans/
- It must preserve order : if i = retrieve all countries and then all
lastName, the nth country and the = nth lastName should correspond to the = same
customer.
Can only be guaranteed if every = customer has a value for every field. Or if you use a custom null = value. 
- = Sometimes I will have to retrieve all values of multiples fields (< = 10)
There is no provision for server side joins. = If you have a query you use often it is best to materialise the result = .

Hope that = helps. 

http://www.thelastpickle.com

On 20/02/2012, at 11:49 PM, acoudeyras wrote:

Hi,

I'm new to Cassandra and i'm looking for = the best way to handle my use case.

My entities look like = :

customers : [{
id: = 3F2504E0-4F89-11D3-9A0C-0305E82C3301,
firstName: "Carl",
lastName: = "Smith",
= country:"FR"
},{
= id:21EC2020-3AEA-1069-A2DD-08002B30309D,
= firstName: "John",
lastName: "Doe"
= country:"EN"
}]

I will use the term "field" to describe = a property of customer (lastName for
example).

I will have 1 = millions of customers and more than 300 fields (firstName,
lastName, = ...) for each customer.

I have two requirements :

- I need = to retrieve all values of a field (all firstNames, all = lastNames,
...).
- The fastest the better (1 to 3 = seconds)
= - It must preserve order : if i retrieve all countries and then = all
lastName, the nth country and the nth lastName should correspond = to the same
customer.
- Sometimes I will have to = retrieve all values of multiples fields (< 10)

- Datas will be = updated (insert, delete, update), every 10 or 20 minutes in
bulk, = just a small number of entities will change each time. When an = update
occurs, in input I have the whole entity (a full customer with = all his
fields). Performance is important, but less than in the = previous case (10
seconds for updating is ok).

- Retrieving a = customer by id or retrieving a list of customer with some
specific = criteria is *not* a requirement.

---
Solution 1:

Column = Family : customers
One row for each customer : 1 million rows
One = column for each field : 300 fields by row.

Benefits : easy to = update
Problem : As far as i understand, it doesn't seems to fit with = cassandra
model, getting all values will be = slow.

---
Solution 2:

Wide Row for the whole = entity

Column Family : datas
One row : customers
Composite = Columns : (fieldName, ID) =3D fieldValue

Customers : [{
= ("country", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") =3D = "FR",
= ("country", "21EC2020-3AEA-1069-A2DD-08002B30309D") =3D = "EN",
= ("firstName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") =3D = "Carl",
= ("firstName", "21EC2020-3AEA-1069-A2DD-08002B30309D") =3D = "John",
= ("lastName", "3F2504E0-4F89-11D3-9A0C-0305E82C3301") =3D = "Smith",
= ("lastName", "21EC2020-3AEA-1069-A2DD-08002B30309D") =3D = "Doe",
...
}]


As far as i understand it seems to be the = fastest way to retrieve all values
of a field in the same = order.
To update, i don't need to read before writing.

Problem = : the row will be very large : 300 000 000 of columns. I can split
it = in different rows based on the value of the specific field, for = example
country.

---
Solution 3:

Wide Row by field =

Column Family : customers
One row by field : so 300 = rows
Columns : ID =3D FieldValue

Benefits :
The row will be = smaller, 1 000 000 colums.

Problem :
Update seems more = expensive, for every customer to update, i need to update
300 = rows.

---

Witch solution seems to be the good one ? Does = Cassandra is really a good
fit for this use case = ?

Thanks

Alexis Coudeyras

--
View this message = in context: http://cassandra-user-incubator-apach= e-org.3065146.n2.nabble.com/Data-Modeling-tp7300846p7300846.html
Se= nt from the cassandra-user@incubat= or.apache.org mailing list archive at Nabble.com.


<= /html>= --Apple-Mail=_BE78C827-4AA3-434C-A42E-F4FD310E59E4--