Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of dsimeonov@gmail.com designates
 209.85.211.190 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=xz0uJ7d8NWN3hm0KaehRLKvuMWwLWr/fYeMdTMDPzeOupRWZjqWF13Lj34IckU73JI
         ybf75Md+amT4iv4iWJCcXO2XVv3lnWGlHOaBpheOI9qV0lWYSrD5DL7XVtMnte+wfjX5
         Wc6l54+df8rpRjIpmT8yy9XI0g74T9CNgg2PQ=
MIME-Version: 1.0
In-Reply-To: <EDC65E17-3A7E-4410-B214-F11FBFCB1464@gmail.com>
References: <10706384-61F5-48C1-B86C-C473AAF3089D@ucsfcti.org>
	 <3FF7CD22-CEF6-47D6-89E5-B238EA9DE964@gmail.com>
	 <u2j548e33c91005040606xaf5b4330w565f5747a48217c8@mail.gmail.com>
	 <EDC65E17-3A7E-4410-B214-F11FBFCB1464@gmail.com>
Date: Wed, 5 May 2010 13:05:39 +0300
Message-ID: <q2g548e33c91005050305gcd95024dudea80217352d510b@mail.gmail.com>
Subject: Re: Best way to store millisecond-accurate data
From: =?UTF-8?B?0JTQsNC90LjQtdC7INCh0LjQvNC10L7QvdC+0LI=?=
 <dsimeonov@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001636eee1f9d3d46a0485d5f75d

--001636eee1f9d3d46a0485d5f75d
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi
    "In practice, one would want to model their data such that the 'row has
too much columns' scenario is prevented."
   I am curious how really to prevent this, if the data is sharded with one
day granularity, nothing stops the client to insert enormous amount of new
columns (very often it is not possible to foreseen how much data clients
would insert) then some functionality is needed prevent too much columns in
a row (too much depends on the data), then such runtime sharding in
necessary (to split the day granulary to two rows). I still think if this
runtime sharding is possible in cassandra.
Best regards, Daniel.

2010/5/4 Miguel Verde <miguelitovert@gmail.com>

> One would use batch processes (e.g. through Hadoop) or client-side
> aggregation, yes. In theory it would be possible to introduce runtime
> sharding across rows into the Cassandra server side, but it's not part of
> its design.
>
> In practice, one would want to model their data such that the 'row has to=
o
> much columns' scenario is prevented.
>
> On May 4, 2010, at 8:06 AM, =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=
=D0=B8=D0=BC=D0=B5=D0=BE=D0=BD=D0=BE=D0=B2 <dsimeonov@gmail.com> wrote:
>
> Hi Miguel,
>   I'd like to ask is it possible to have runtime sharding or rows in
> cassandra, i.e. if the row has too much new columns inserted then create
> another one row (let's say if the original timesharding is one day per ro=
w,
> then we would have two rows for that day). Maybe batch processes could do
> that.
> Best regards, Daniel.
>
> 2010/4/24 Miguel Verde < <miguelitovert@gmail.com>miguelitovert@gmail.com=
>
>
>> TimeUUID's time component is measured in 100-nanosecond intervals. The
>> library you use might calculate it with poorer accuracy or precision, bu=
t
>> from a storage/comparison standpoint in Cassandra millisecond data is ea=
sily
>> captured by it.
>>
>> One typical way of dealing with the data explosion of sampled time serie=
s
>> data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that y=
ou
>> put an upper bound on the row length.
>>
>>
>> On Apr 23, 2010, at 7:01 PM, Andrew Nguyen <<andrew-lists-cassandra@ucsf=
cti.org>
>> andrew-lists-cassandra@ucsfcti.org> wrote:
>>
>>  Hello,
>>>
>>> I am looking to store patient physiologic data in Cassandra - it's bein=
g
>>> collected at rates of 1 to 125 Hz.  I'm thinking of storing the timesta=
mps
>>> as the column names and the patient/parameter combo as the row key.  Fo=
r
>>> example, Bob is in the ICU and is currently having his blood pressure,
>>> intracranial pressure, and heart rate monitored.  I'd like to collect t=
his
>>> with the following row keys:
>>>
>>> Bob-bloodpressure
>>> Bob-intracranialpressure
>>> Bob-heartrate
>>>
>>> The column names would be timestamps but that's where my questions star=
t:
>>>
>>> I'm not sure what the best data type and CompareWith would be.  From my
>>> searching, it sounds like the TimeUUID may be suitable but isn't really
>>> designed for millisecond accuracy.  My other thought is just to store t=
hem
>>> as strings (2010-04-23 10:23:45.016).  While I space isn't the foremost
>>> concern, we will be collecting this data 24/7 so we'll be creating many
>>> columns over the long-term.
>>>
>>> I found <https://issues.apache.org/jira/browse/CASSANDRA-16>
>>> https://issues.apache.org/jira/browse/CASSANDRA-16 which states that th=
e
>>> entire row must fit in memory.  Does this include the values as well as=
 the
>>> column names?
>>>
>>> In considering the limits of cassandra and the best way to model this, =
we
>>> would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7).
>>>  However, I can't really think of a better way to model this...  So, am=
 I
>>> thinking about this all wrong or am I on the right track?
>>>
>>> Thanks,
>>> Andrew
>>>
>>
>

--001636eee1f9d3d46a0485d5f75d
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Hi<div>=C2=A0=C2=A0 =C2=A0&quot;In practice, one would want to model their =
data such that the &#39;row has too much columns&#39; scenario is prevented=
.&quot;</div><div>=C2=A0=C2=A0 I am curious how really to prevent this, if =
the data is sharded with one day granularity, nothing stops the client to i=
nsert enormous amount of new columns (very often it is not possible to fore=
seen how much data clients would insert) then some functionality is needed =
prevent too much columns in a row (too much depends on the data), then such=
 runtime sharding in necessary (to split the day granulary to two rows). I =
still think if this runtime sharding is possible in cassandra. <br>
</div><div><div>Best regards, Daniel.<br>
<br><div class=3D"gmail_quote">2010/5/4 Miguel Verde <span dir=3D"ltr">&lt;=
<a href=3D"mailto:miguelitovert@gmail.com" target=3D"_blank">miguelitovert@=
gmail.com</a>&gt;</span><br><blockquote class=3D"gmail_quote" style=3D"marg=
in: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-l=
eft: 1ex;">

<div bgcolor=3D"#FFFFFF"><div>One would use batch processes (e.g. through H=
adoop) or client-side aggregation, yes. In theory it would be possible to i=
ntroduce runtime sharding across rows into the Cassandra server side, but i=
t&#39;s not part of its design.<br>

<div><br></div><div>In practice, one would want to model their data such th=
at the &#39;row has too much columns&#39; scenario is prevented.</div></div=
><div><div></div><div><div><br>On May 4, 2010, at 8:06 AM, =D0=94=D0=B0=D0=
=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5=D0=BE=D0=BD=D0=BE=D0=B2 &lt;=
<a href=3D"mailto:dsimeonov@gmail.com" target=3D"_blank">dsimeonov@gmail.co=
m</a>&gt; wrote:<br>

<br></div><div></div><blockquote type=3D"cite"><div>Hi Miguel,<div>=C2=A0=
=C2=A0I&#39;d like to ask is it possible to have runtime sharding or rows i=
n cassandra, i.e. if the row has too much new columns inserted then create =
another one row (let&#39;s say if the original timesharding is one day per =
row, then we would have two rows for that day). Maybe batch processes could=
 do that.=C2=A0<br>


<div>Best regards, Daniel.<br><br><div class=3D"gmail_quote">2010/4/24 Migu=
el Verde <span dir=3D"ltr">&lt;<a href=3D"mailto:miguelitovert@gmail.com" t=
arget=3D"_blank"></a><a href=3D"mailto:miguelitovert@gmail.com" target=3D"_=
blank">miguelitovert@gmail.com</a>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
TimeUUID&#39;s time component is measured in 100-nanosecond intervals. The =
library you use might calculate it with poorer accuracy or precision, but f=
rom a storage/comparison standpoint in Cassandra millisecond data is easily=
 captured by it.<br>


<br>
One typical way of dealing with the data explosion of sampled time series d=
ata is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so that you p=
ut an upper bound on the row length.<div><div></div><div><br>

<br>
On Apr 23, 2010, at 7:01 PM, Andrew Nguyen &lt;<a href=3D"mailto:andrew-lis=
ts-cassandra@ucsfcti.org" target=3D"_blank"></a><a href=3D"mailto:andrew-li=
sts-cassandra@ucsfcti.org" target=3D"_blank">andrew-lists-cassandra@ucsfcti=
.org</a>&gt; wrote:<br>


<br>
<blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; borde=
r-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
Hello,<br>
<br>
I am looking to store patient physiologic data in Cassandra - it&#39;s bein=
g collected at rates of 1 to 125 Hz. =C2=A0I&#39;m thinking of storing the =
timestamps as the column names and the patient/parameter combo as the row k=
ey. =C2=A0For example, Bob is in the ICU and is currently having his blood =
pressure, intracranial pressure, and heart rate monitored. =C2=A0I&#39;d li=
ke to collect this with the following row keys:<br>


<br>
Bob-bloodpressure<br>
Bob-intracranialpressure<br>
Bob-heartrate<br>
<br>
The column names would be timestamps but that&#39;s where my questions star=
t:<br>
<br>
I&#39;m not sure what the best data type and CompareWith would be. =C2=A0Fr=
om my searching, it sounds like the TimeUUID may be suitable but isn&#39;t =
really designed for millisecond accuracy. =C2=A0My other thought is just to=
 store them as strings (2010-04-23 10:23:45.016). =C2=A0While I space isn&#=
39;t the foremost concern, we will be collecting this data 24/7 so we&#39;l=
l be creating many columns over the long-term.<br>


<br>
I found <a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-16" targ=
et=3D"_blank"></a><a href=3D"https://issues.apache.org/jira/browse/CASSANDR=
A-16" target=3D"_blank">https://issues.apache.org/jira/browse/CASSANDRA-16<=
/a> which states that the entire row must fit in memory. =C2=A0Does this in=
clude the values as well as the column names?<br>


<br>
In considering the limits of cassandra and the best way to model this, we w=
ould be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7). =C2=A0Ho=
wever, I can&#39;t really think of a better way to model this... =C2=A0So, =
am I thinking about this all wrong or am I on the right track?<br>


<br>
Thanks,<br>
Andrew<br>
</blockquote>
</div></div></blockquote></div><br></div></div>
</div></blockquote></div></div></div></blockquote></div><br></div></div>

--001636eee1f9d3d46a0485d5f75d--