Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of miguelitovert@gmail.com
 designates 209.85.223.195 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=message-id:from:to:in-reply-to:content-type
         :content-transfer-encoding:x-mailer:mime-version:subject:date
         :references;
        b=vXuk3nn/e65imUUu0u9Umy18khLfqng7di8GLfvGEVEWYPc0ewD9Ovy9wioZU02LkZ
         EG0y6XnCy8ZboxExCFn9V2FGenNnDIibik5b5d/uaUGIkocUFiZSuUp6XJEeSzIpXjHz
         bcliRvxmJOVEVQz5Qt2a0XYU1XEVwkAlc4cvU=
Message-Id: <EDC65E17-3A7E-4410-B214-F11FBFCB1464@gmail.com>
From: Miguel Verde <miguelitovert@gmail.com>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
In-Reply-To: <u2j548e33c91005040606xaf5b4330w565f5747a48217c8@mail.gmail.com>
Content-Type: multipart/alternative;
	boundary=Apple-Mail-1-471296460
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (iPhone Mail 7D11)
Subject: Re: Best way to store millisecond-accurate data
Date: Tue, 4 May 2010 09:35:55 -0500
References: <10706384-61F5-48C1-B86C-C473AAF3089D@ucsfcti.org>
 <3FF7CD22-CEF6-47D6-89E5-B238EA9DE964@gmail.com>
 <u2j548e33c91005040606xaf5b4330w565f5747a48217c8@mail.gmail.com>


--Apple-Mail-1-471296460
Content-Type: text/plain;
	charset=utf-8;
	format=flowed;
	delsp=yes
Content-Transfer-Encoding: quoted-printable

One would use batch processes (e.g. through Hadoop) or client-side =20
aggregation, yes. In theory it would be possible to introduce runtime =20=

sharding across rows into the Cassandra server side, but it's not part =20=

of its design.

In practice, one would want to model their data such that the 'row has =20=

too much columns' scenario is prevented.

On May 4, 2010, at 8:06 AM, =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =
=D0=A1=D0=B8=D0=BC=D0=B5=D0=BE=D0=BD=D0=BE=D0=B2 =20
<dsimeonov@gmail.com> wrote:

> Hi Miguel,
>   I'd like to ask is it possible to have runtime sharding or rows in =20=

> cassandra, i.e. if the row has too much new columns inserted then =20
> create another one row (let's say if the original timesharding is =20
> one day per row, then we would have two rows for that day). Maybe =20
> batch processes could do that.
> Best regards, Daniel.
>
> 2010/4/24 Miguel Verde <miguelitovert@gmail.com>
> TimeUUID's time component is measured in 100-nanosecond intervals. =20
> The library you use might calculate it with poorer accuracy or =20
> precision, but from a storage/comparison standpoint in Cassandra =20
> millisecond data is easily captured by it.
>
> One typical way of dealing with the data explosion of sampled time =20
> series data is to bucket/shard rows (i.e. Bob-20100423-=20
> bloodpressure) so that you put an upper bound on the row length.
>
>
> On Apr 23, 2010, at 7:01 PM, Andrew Nguyen =
<andrew-lists-cassandra@ucsfcti.org=20
> > wrote:
>
> Hello,
>
> I am looking to store patient physiologic data in Cassandra - it's =20
> being collected at rates of 1 to 125 Hz.  I'm thinking of storing =20
> the timestamps as the column names and the patient/parameter combo =20
> as the row key.  For example, Bob is in the ICU and is currently =20
> having his blood pressure, intracranial pressure, and heart rate =20
> monitored.  I'd like to collect this with the following row keys:
>
> Bob-bloodpressure
> Bob-intracranialpressure
> Bob-heartrate
>
> The column names would be timestamps but that's where my questions =20
> start:
>
> I'm not sure what the best data type and CompareWith would be.  =46rom =
=20
> my searching, it sounds like the TimeUUID may be suitable but isn't =20=

> really designed for millisecond accuracy.  My other thought is just =20=

> to store them as strings (2010-04-23 10:23:45.016).  While I space =20
> isn't the foremost concern, we will be collecting this data 24/7 so =20=

> we'll be creating many columns over the long-term.
>
> I found https://issues.apache.org/jira/browse/CASSANDRA-16 which =20
> states that the entire row must fit in memory.  Does this include =20
> the values as well as the column names?
>
> In considering the limits of cassandra and the best way to model =20
> this, we would be adding 3.9 billion rows per year (assuming 125 Hz =20=

> @ 24/7).  However, I can't really think of a better way to model =20
> this...  So, am I thinking about this all wrong or am I on the right =20=

> track?
>
> Thanks,
> Andrew
>

--Apple-Mail-1-471296460
Content-Type: text/html;
	charset=utf-8
Content-Transfer-Encoding: quoted-printable

<html><body bgcolor=3D"#FFFFFF"><div>One would use batch processes (e.g. =
through Hadoop) or client-side aggregation, yes. In theory it would be =
possible to introduce runtime sharding across rows into the Cassandra =
server side, but it's not part of its design.<br><div><br></div><div>In =
practice, one would want to model their data such that the 'row has too =
much columns' scenario is prevented.</div></div><div><br>On May 4, 2010, =
at 8:06 AM, =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5=D0=
=BE=D0=BD=D0=BE=D0=B2 &lt;<a =
href=3D"mailto:dsimeonov@gmail.com">dsimeonov@gmail.com</a>&gt; =
wrote:<br><br></div><div></div><blockquote type=3D"cite"><div>Hi =
Miguel,<div>&nbsp;&nbsp;I'd like to ask is it possible to have runtime =
sharding or rows in cassandra, i.e. if the row has too much new columns =
inserted then create another one row (let's say if the original =
timesharding is one day per row, then we would have two rows for that =
day). Maybe batch processes could do that.&nbsp;<br>
<div>Best regards, Daniel.<br><br><div class=3D"gmail_quote">2010/4/24 =
Miguel Verde <span dir=3D"ltr">&lt;<a =
href=3D"mailto:miguelitovert@gmail.com"><a =
href=3D"mailto:miguelitovert@gmail.com">miguelitovert@gmail.com</a></a>&gt=
;</span><br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex;">
TimeUUID's time component is measured in 100-nanosecond intervals. The =
library you use might calculate it with poorer accuracy or precision, =
but from a storage/comparison standpoint in Cassandra millisecond data =
is easily captured by it.<br>

<br>
One typical way of dealing with the data explosion of sampled time =
series data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so =
that you put an upper bound on the row length.<div><div></div><div =
class=3D"h5"><br>

<br>
On Apr 23, 2010, at 7:01 PM, Andrew Nguyen &lt;<a =
href=3D"mailto:andrew-lists-cassandra@ucsfcti.org" target=3D"_blank"><a =
href=3D"mailto:andrew-lists-cassandra@ucsfcti.org">andrew-lists-cassandra@=
ucsfcti.org</a></a>&gt; wrote:<br>
<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 =
.8ex;border-left:1px #ccc solid;padding-left:1ex">
Hello,<br>
<br>
I am looking to store patient physiologic data in Cassandra - it's being =
collected at rates of 1 to 125 Hz. &nbsp;I'm thinking of storing the =
timestamps as the column names and the patient/parameter combo as the =
row key. &nbsp;For example, Bob is in the ICU and is currently having =
his blood pressure, intracranial pressure, and heart rate monitored. =
&nbsp;I'd like to collect this with the following row keys:<br>

<br>
Bob-bloodpressure<br>
Bob-intracranialpressure<br>
Bob-heartrate<br>
<br>
The column names would be timestamps but that's where my questions =
start:<br>
<br>
I'm not sure what the best data type and CompareWith would be. =
&nbsp;=46rom my searching, it sounds like the TimeUUID may be suitable =
but isn't really designed for millisecond accuracy. &nbsp;My other =
thought is just to store them as strings (2010-04-23 10:23:45.016). =
&nbsp;While I space isn't the foremost concern, we will be collecting =
this data 24/7 so we'll be creating many columns over the long-term.<br>

<br>
I found <a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-16" =
target=3D"_blank"><a =
href=3D"https://issues.apache.org/jira/browse/CASSANDRA-16">https://issues=
.apache.org/jira/browse/CASSANDRA-16</a></a> which states that the =
entire row must fit in memory. &nbsp;Does this include the values as =
well as the column names?<br>

<br>
In considering the limits of cassandra and the best way to model this, =
we would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7). =
&nbsp;However, I can't really think of a better way to model this... =
&nbsp;So, am I thinking about this all wrong or am I on the right =
track?<br>

<br>
Thanks,<br>
Andrew<br>
</blockquote>
</div></div></blockquote></div><br></div></div>
</div></blockquote></body></html>=

--Apple-Mail-1-471296460--