Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id F04FF9902 for ; Thu, 16 Feb 2012 19:23:53 +0000 (UTC) Received: (qmail 98976 invoked by uid 500); 16 Feb 2012 19:23:51 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 98950 invoked by uid 500); 16 Feb 2012 19:23:51 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 98939 invoked by uid 99); 16 Feb 2012 19:23:51 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Feb 2012 19:23:51 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of NSammons@ften.com designates 207.5.74.110 as permitted sender) Received: from [207.5.74.110] (HELO EXHUB003-3.exch003intermedia.net) (207.5.74.110) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 16 Feb 2012 19:23:42 +0000 Received: from EXVDMBX003-1.exch003intermedia.net ([207.5.72.16]) by EXHUB003-3.exch003intermedia.net ([207.5.74.110]) with mapi; Thu, 16 Feb 2012 11:23:21 -0800 From: Nate Sammons To: "user@cassandra.apache.org" Date: Thu, 16 Feb 2012 11:23:25 -0800 Subject: Best way to store and index time series items with multiple other dimensions? Thread-Topic: Best way to store and index time series items with multiple other dimensions? Thread-Index: Aczs4D6w66qwx+YmQL6cThB3rC4GVg== Message-ID: <95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848@EXVDMBX003-1.exch003intermedia.net> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848EXVDMBX0031ex_" MIME-Version: 1.0 X-Virus-Checked: Checked by ClamAV on apache.org --_000_95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848EXVDMBX0031ex_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable I'm trying to figure out the best way to store items for query based on mul= tiple dimensions. I've got a large volume (many 100s of millions per day) = of time-ordered objects with 10+ properties each that I need to support arb= itrary query expressions on. So I may need to support a query based on a s= egment of time plus an expression like "A =3D=3D 'foo' and B =3D=3D 'bar' a= nd C =3D=3D 'baz'" etc... Any pointers? For simple time-ordered retrieval I was going to have a set of time buckets= used as row keys, something like YYYY-MM-DD-HH, and with an extra characte= r or two to reduce hotspots (probably take a hash of the object and use the= first byte of the hash in hex), so a row key might look like: 2012-02-16-09:a This way I'm spreading writes for that hour across 16 rows. Then the colum= n name would be a TimeUUID or some time-based value, and the column value w= ould be the object. This lets me easily slice out segments of time, and le= ts me write data really well. However if I need to satisfy a query for items matching some expression dur= ing the day, I have to scan a *lot* of records. I can require some propert= y to always be present in the query, and I can base the above extra byte in= the row key, so when I scan records I can cut down the number of row keys = read by a factor of 16, but that's still a huge amount of data to just scan= through. One obvious choice here are secondary indexes, but that implies "short" row= s that can't be time sliced as easily, and I don't know that have a bunch o= f secondary indexes will scale very well (or support range queries). Any ideas on a way to structure data for easy queries like this? Thanks, -nate Nate Sammons | Sr. Technical Specialist | FTEN, A NASDAQ OMX Company Office: +1.720.889.5141 | Email: nsammons@ften.com Aggregation. Transparency. Control. (tm) | www.FTEN.com --_000_95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848EXVDMBX0031ex_ Content-Type: text/html; charset="us-ascii" Content-Transfer-Encoding: quoted-printable

I’m t= rying to figure out the best way to store items for query based on multiple= dimensions.  I’ve got a large volume (many 100s of millions per= day) of time-ordered objects with 10+ properties each that I need to suppo= rt arbitrary query expressions on.  So I may need to support a query b= ased on a segment of time plus an expression like “A =3D=3D ‘fo= o’ and B =3D=3D ‘bar’ and C =3D=3D ‘baz’̶= 1; etc…  Any pointers?

 

For simple time= -ordered retrieval I was going to have a set of time buckets used as row ke= ys, something like YYYY-MM-DD-HH, and with an extra character or two to red= uce hotspots (probably take a hash of the object and use the first byte of = the hash in hex), so a row key might look like:

 

&= nbsp;  2012-02-16-09:a

=  

This way I’m sp= reading writes for that hour across 16 rows.  Then the column name wou= ld be a TimeUUID or some time-based value, and the column value would be th= e object.  This lets me easily slice out segments of time, and lets me= write data really well.

 

However if I need to sat= isfy a query for items matching some expression during the day, I have to s= can a *lot* of records.  I can require some property to always = be present in the query, and I can base the above extra byte in the row key= , so when I scan records I can cut down the number of row keys read by a fa= ctor of 16, but that’s still a huge amount of data to just scan throu= gh.

 =

One obvious choice here are secondary indexes= , but that implies “short” rows that can’t be time sliced= as easily, and I don’t know that have a bunch of secondary indexes w= ill scale very well (or support range queries).

 

<= o:p> 

Any ideas on a way to = structure data for easy queries like this?

 

&= nbsp;

Thanks,

 

-nate

 

 

 

= Nate Sammons Sr. Technical Specialist FTEN, A NASDAQ OMX Company 

=

Office: +1.720.889.5141 | Em= ail: nsammons@ften.com

Aggregation.  Transparency.  Control.   | www.FTEN.com<= /span>

 <= /p>

= --_000_95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848EXVDMBX0031ex_--