Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of NSammons@ften.com designates
 207.5.74.110 as permitted sender)
From: Nate Sammons <NSammons@ften.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thu, 16 Feb 2012 11:23:25 -0800
Subject: Best way to store and index time series items with multiple other
 dimensions?
Thread-Topic: Best way to store and index time series items with multiple
 other dimensions?
Thread-Index: Aczs4D6w66qwx+YmQL6cThB3rC4GVg==
Message-ID: 
 <95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848@EXVDMBX003-1.exch003intermedia.net>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: multipart/alternative;
	boundary="_000_95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848EXVDMBX0031ex_"
MIME-Version: 1.0

--_000_95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848EXVDMBX0031ex_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

I'm trying to figure out the best way to store items for query based on mul=
tiple dimensions.  I've got a large volume (many 100s of millions per day) =
of time-ordered objects with 10+ properties each that I need to support arb=
itrary query expressions on.  So I may need to support a query based on a s=
egment of time plus an expression like "A =3D=3D 'foo' and B =3D=3D 'bar' a=
nd C =3D=3D 'baz'" etc...  Any pointers?

For simple time-ordered retrieval I was going to have a set of time buckets=
 used as row keys, something like YYYY-MM-DD-HH, and with an extra characte=
r or two to reduce hotspots (probably take a hash of the object and use the=
 first byte of the hash in hex), so a row key might look like:

   2012-02-16-09:a

This way I'm spreading writes for that hour across 16 rows.  Then the colum=
n name would be a TimeUUID or some time-based value, and the column value w=
ould be the object.  This lets me easily slice out segments of time, and le=
ts me write data really well.

However if I need to satisfy a query for items matching some expression dur=
ing the day, I have to scan a *lot* of records.  I can require some propert=
y to always be present in the query, and I can base the above extra byte in=
 the row key, so when I scan records I can cut down the number of row keys =
read by a factor of 16, but that's still a huge amount of data to just scan=
 through.

One obvious choice here are secondary indexes, but that implies "short" row=
s that can't be time sliced as easily, and I don't know that have a bunch o=
f secondary indexes will scale very well (or support range queries).


Any ideas on a way to structure data for easy queries like this?


Thanks,

-nate


Nate Sammons | Sr. Technical Specialist | FTEN, A NASDAQ OMX Company
Office: +1.720.889.5141 | Email: nsammons@ften.com<mailto:nsammons@ften.com=
>
Aggregation.  Transparency.  Control. (tm)  | www.FTEN.com<http://ften.com/=
>


--_000_95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848EXVDMBX0031ex_
Content-Type: text/html; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D"urn:schemas-microsoft-com:vml" xmlns:o=3D"urn:schemas-micr=
osoft-com:office:office" xmlns:w=3D"urn:schemas-microsoft-com:office:word" =
xmlns:m=3D"http://schemas.microsoft.com/office/2004/12/omml" xmlns=3D"http:=
//www.w3.org/TR/REC-html40"><head><meta http-equiv=3DContent-Type content=
=3D"text/html; charset=3Dus-ascii"><meta name=3DGenerator content=3D"Micros=
oft Word 14 (filtered medium)"><style><!--
/* Font Definitions */
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri","sans-serif";}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
span.EmailStyle17
	{mso-style-type:personal-compose;
	font-family:"Arial","sans-serif";
	color:#365F91;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-family:"Calibri","sans-serif";}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D"edit" spidmax=3D"1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D"edit">
<o:idmap v:ext=3D"edit" data=3D"1" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-US link=3Dblue vli=
nk=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span style=3D'f=
ont-size:10.0pt;font-family:"Arial","sans-serif";color:#365F91'>I&#8217;m t=
rying to figure out the best way to store items for query based on multiple=
 dimensions.&nbsp; I&#8217;ve got a large volume (many 100s of millions per=
 day) of time-ordered objects with 10+ properties each that I need to suppo=
rt arbitrary query expressions on.&nbsp; So I may need to support a query b=
ased on a segment of time plus an expression like &#8220;A =3D=3D &#8216;fo=
o&#8217; and B =3D=3D &#8216;bar&#8217; and C =3D=3D &#8216;baz&#8217;&#822=
1; etc&#8230;&nbsp; Any pointers?<o:p></o:p></span></p><p class=3DMsoNormal=
><span style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#36=
5F91'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span style=3D'font-=
size:10.0pt;font-family:"Arial","sans-serif";color:#365F91'>For simple time=
-ordered retrieval I was going to have a set of time buckets used as row ke=
ys, something like YYYY-MM-DD-HH, and with an extra character or two to red=
uce hotspots (probably take a hash of the object and use the first byte of =
the hash in hex), so a row key might look like:<o:p></o:p></span></p><p cla=
ss=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:"Arial","sans-se=
rif";color:#365F91'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#365F91'>&=
nbsp;&nbsp; 2012-02-16-09:a<o:p></o:p></span></p><p class=3DMsoNormal><span=
 style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#365F91'>=
<o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:1=
0.0pt;font-family:"Arial","sans-serif";color:#365F91'>This way I&#8217;m sp=
reading writes for that hour across 16 rows.&nbsp; Then the column name wou=
ld be a TimeUUID or some time-based value, and the column value would be th=
e object.&nbsp; This lets me easily slice out segments of time, and lets me=
 write data really well.<o:p></o:p></span></p><p class=3DMsoNormal><span st=
yle=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#365F91'><o:=
p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0=
pt;font-family:"Arial","sans-serif";color:#365F91'>However if I need to sat=
isfy a query for items matching some expression during the day, I have to s=
can a *<b>lot</b>* of records.&nbsp; I can require some property to always =
be present in the query, and I can base the above extra byte in the row key=
, so when I scan records I can cut down the number of row keys read by a fa=
ctor of 16, but that&#8217;s still a huge amount of data to just scan throu=
gh.<o:p></o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0=
pt;font-family:"Arial","sans-serif";color:#365F91'><o:p>&nbsp;</o:p></span>=
</p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:"Arial=
","sans-serif";color:#365F91'>One obvious choice here are secondary indexes=
, but that implies &#8220;short&#8221; rows that can&#8217;t be time sliced=
 as easily, and I don&#8217;t know that have a bunch of secondary indexes w=
ill scale very well (or support range queries).<o:p></o:p></span></p><p cla=
ss=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:"Arial","sans-se=
rif";color:#365F91'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span =
style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#365F91'><=
o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10=
.0pt;font-family:"Arial","sans-serif";color:#365F91'>Any ideas on a way to =
structure data for easy queries like this?<o:p></o:p></span></p><p class=3D=
MsoNormal><span style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";=
color:#365F91'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><span style=
=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#365F91'><o:p>&=
nbsp;</o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;=
font-family:"Arial","sans-serif";color:#365F91'>Thanks,<o:p></o:p></span></=
p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:"Arial",=
"sans-serif";color:#365F91'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNorma=
l><span style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#3=
65F91'>-nate<o:p></o:p></span></p><p class=3DMsoNormal><span style=3D'font-=
size:10.0pt;font-family:"Arial","sans-serif";color:#365F91'><o:p>&nbsp;</o:=
p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-fami=
ly:"Arial","sans-serif";color:#365F91'><o:p>&nbsp;</o:p></span></p><p class=
=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:"Arial","sans-seri=
f";color:#365F91'><o:p>&nbsp;</o:p></span></p><p class=3DMsoNormal><b><span=
 style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#1E487C'>=
Nate Sammons&nbsp;</span></b><span style=3D'font-size:10.0pt;font-family:"A=
rial","sans-serif";color:#7F7F7F'>|&nbsp;<b>Sr. Technical Specialist</b></s=
pan><b><span style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";col=
or:#1E487C'>&nbsp;</span></b><span style=3D'font-size:10.0pt;font-family:"A=
rial","sans-serif";color:#7F7F7F'>|&nbsp;</span><b><span style=3D'font-size=
:10.0pt;font-family:"Arial","sans-serif";color:#919191'>FTEN,</span></b><sp=
an style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";color:#919191=
'>&nbsp;<b>A NASDAQ OMX Company&nbsp;</b></span><span style=3D'font-size:10=
.0pt;font-family:"Arial","sans-serif";color:#7F7F7F'><o:p></o:p></span></p>=
<p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:"Arial","s=
ans-serif";color:#7F7F7F'>Office: +1.720.</span><span style=3D'font-size:10=
.0pt;font-family:"Arial","sans-serif";color:gray'>889</span><span style=3D'=
font-size:10.0pt;font-family:"Arial","sans-serif";color:#7F7F7F'>.5141 | Em=
ail: </span><span style=3D'font-size:10.0pt;font-family:"Arial","sans-serif=
";color:gray'><a href=3D"mailto:nsammons@ften.com"><span style=3D'color:gra=
y'>nsammons@ften.com</span></a></span><u><span style=3D'font-size:10.0pt;fo=
nt-family:"Arial","sans-serif";color:#365F91'><o:p></o:p></span></u></p><p =
class=3DMsoNormal><b><span style=3D'font-size:10.0pt;font-family:"Arial","s=
ans-serif";color:#335B8F'>Aggregation. &nbsp;Transparency. &nbsp;Control.</=
span></b><span style=3D'font-size:10.0pt;font-family:"Arial","sans-serif";c=
olor:#335B8F'> <b>&#8482;</b></span><span style=3D'font-size:10.0pt;font-fa=
mily:"Arial","sans-serif";color:black'>&nbsp;&nbsp;|&nbsp;</span><span styl=
e=3D'color:#365F91'><a href=3D"http://ften.com/"><span style=3D'font-size:1=
0.0pt;font-family:"Arial","sans-serif";color:blue'>www.FTEN.com</span></a><=
/span><span style=3D'font-size:12.0pt;font-family:"Times New Roman","serif"=
;color:black'><o:p></o:p></span></p><p class=3DMsoNormal><o:p>&nbsp;</o:p><=
/p></div></body></html>=

--_000_95AD5EB0BCCF284CB0194E8300A23E4A4DECBAC848EXVDMBX0031ex_--