From user-return-23978-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Mon Feb 6 19:39:50 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 16156928E for ; Mon, 6 Feb 2012 19:39:50 +0000 (UTC) Received: (qmail 7474 invoked by uid 500); 6 Feb 2012 19:39:47 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 7423 invoked by uid 500); 6 Feb 2012 19:39:46 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 7415 invoked by uid 99); 6 Feb 2012 19:39:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 19:39:46 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a49.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 19:39:40 +0000 Received: from homiemail-a49.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a49.g.dreamhost.com (Postfix) with ESMTP id 0C3EB5E0059 for ; Mon, 6 Feb 2012 11:39:15 -0800 (PST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; q=dns; s=thelastpickle.com; b=kuMmXydncr gMZ2MfMU1PpG5Z5xsld7NNH2gwLdzA9Z3gp6sSvI8degiF+dIv0Pb/fl3xA+LjMU C7giCI1l2P0IMcM+Z8ihZEvZMWtkDkIqXc6CepvrADd4BvhfW28/PTDQ9YmvCCPk cjT1OvXialkEq22bgdm8gDaz6KYZyt1uY= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :mime-version:content-type:subject:date:in-reply-to:to :references:message-id; s=thelastpickle.com; bh=2kk6qJUfZUWCjv7I /+l1/84IUXU=; b=GhhaaYNUP4cs8RPDMRxNbeLuN6HmiIr0/Q4K4l2eA0+S1/XH WVsIzPZIH6u/JLALTcycCO3icx1ifQVXj19kqTdSKE8teG+Vu1vzOOSGjuwIFMFf wM9w6dKdZkvRMCFNUqucQQeA3GzUnYWuNL+0DbfG2uvWzMRbi5AIEeVRZnI= Received: from [172.16.1.3] (125-236-193-159.adsl.xtra.co.nz [125.236.193.159]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a49.g.dreamhost.com (Postfix) with ESMTPSA id 01CAC5E0055 for ; Mon, 6 Feb 2012 11:39:13 -0800 (PST) From: aaron morton Mime-Version: 1.0 (Apple Message framework v1251.1) Content-Type: multipart/alternative; boundary="Apple-Mail=_C0DC36EC-9AC0-420D-B690-C8B8BF6B2ACD" Subject: Re: sensible data model ? Date: Tue, 7 Feb 2012 08:39:10 +1300 In-Reply-To: To: user@cassandra.apache.org References: Message-Id: <68619175-4883-4E76-B5D4-B79A952000BC@thelastpickle.com> X-Mailer: Apple Mail (2.1251.1) --Apple-Mail=_C0DC36EC-9AC0-420D-B690-C8B8BF6B2ACD Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Sounds like a good start. Super columns are not a great fit for modeling = time series data for a few reasons, here is one = http://wiki.apache.org/cassandra/CassandraLimitations It's also a good idea to partition time series data so that the rows do = not grow too big. You can have 2 billion columns in a row, but big rows = have operational down sides. You could go with either: rows: column: Which would mean each time your query for a date range you need to query = multiple rows. But it is possible to get a range of columns / = properties. Or rows: column: Where time_partition is something that makes sense in your problem = domain, e.g. a calendar month. If you often query for days in a month = you can then get all the columns for the days you are interested in = (using a column range). If you only want to get a sub set of the entity = properties you will need to get them all and filter them client side, = depending on the number and size of the properties this may be more = efficient than multiple calls.=20 One word of warning, avoid sending read requests for lots (i.e. 100's) = of rows at once it will reduce overall query throughput. Some clients = like pycassa take care of this for you. Good luck.=20 =20 ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 5/02/2012, at 12:12 AM, Franc Carter wrote: >=20 > Hi, >=20 > I'm pretty new to Cassandra and am currently doing a proof of concept, = and thought it would be a good idea to ask if my data model is sane . . = .=20 >=20 > The data I have, and need to query, is reasonably simple. It consists = of about 10 million entities, each of which have a set of key/value = properties for each day for about 10 years. The number of keys is in the = 50-100 range and there will be a lot of overlap for keys in = >=20 > The queries I need to make are for sets of key/value properties for an = entity on a day, e.g key1,keys2,key3 for 10 entities on 20 days. The = number of entities and/or days in the query could be either very small = or very large. >=20 > I've modeled this with a simple column family for the keys with the = row key being the concatenation of the entity and date. My first go, = used only the entity as the row key and then used a supercolumn for each = date. I decided against this mostly because it seemed more complex for a = gain I didn't really understand. >=20 > Does this seem sensible ? >=20 > thanks >=20 > --=20 > Franc Carter | Systems architect | Sirca Ltd > franc.carter@sirca.org.au | www.sirca.org.au > Tel: +61 2 9236 9118=20 > Level 9, 80 Clarence St, Sydney NSW 2000 > PO Box H58, Australia Square, Sydney NSW 1215 >=20 --Apple-Mail=_C0DC36EC-9AC0-420D-B690-C8B8BF6B2ACD Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 http://wiki= .apache.org/cassandra/CassandraLimitations

It's = also a good idea to partition time series data so that the rows do not = grow too big. You can have 2 billion columns in a row, but big rows have = operational down sides.

You could go with = either:

rows: = <entity_id:date>
column: = <property_name>

Which would mean each = time your query for a date range you need to query multiple rows. But it = is possible to get a range of  columns / = properties.

Or

rows: = <entity_id:time_partition>
column: = <date:property_name>

Where time_partition = is something that makes sense in your problem domain, e.g. a calendar = month. If you often query for days in a month you  can then get all = the columns for the days you are interested in (using a column range). = If you only want to get a sub set of the entity properties you will need = to get them all and filter them client side, depending on the number and = size of the properties this may be more efficient than multiple = calls. 

One word of warning, avoid sending = read requests for lots (i.e. 100's) of rows at once it will reduce = overall query throughput. Some clients like pycassa take care of this = for you.

Good = luck. 
 
=
http://www.thelastpickle.com

On 5/02/2012, at 12:12 AM, Franc Carter wrote:


Hi,

I'm pretty new to = Cassandra and am currently doing a proof of concept, and thought it = would be a good idea to ask if my data model is sane . . = . 

The data I have, and need to query, is = reasonably simple. It consists of about 10 million entities, each of = which have a set of key/value properties for each day for about 10 = years. The number of keys is in the 50-100 range and there will be a lot = of overlap for keys in <entity,days>

The queries I need to make are for sets of key/value = properties for an entity on a day, e.g key1,keys2,key3 for 10 entities = on 20 days. The number of entities and/or days in the query could be = either very small or very large.

I've modeled this with a simple column family for = the keys with the row key being the concatenation of the entity and = date. My first go, used only the entity as the row key and then used a = supercolumn for each date. I decided against this mostly because it = seemed more complex for a gain I didn't really understand.

Does this seem sensible = ?

thanks

--
Franc Carter | Systems architect | = Sirca = Ltd
Level 9, 80 Clarence St, = Sydney NSW 2000
PO Box H58, Australia Square, = Sydney NSW 1215


= --Apple-Mail=_C0DC36EC-9AC0-420D-B690-C8B8BF6B2ACD--