Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 31933 invoked from network); 4 May 2010 14:36:56 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 4 May 2010 14:36:56 -0000 Received: (qmail 75603 invoked by uid 500); 4 May 2010 14:36:55 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 75581 invoked by uid 500); 4 May 2010 14:36:55 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 75573 invoked by uid 99); 4 May 2010 14:36:55 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 May 2010 14:36:55 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of miguelitovert@gmail.com designates 209.85.223.195 as permitted sender) Received: from [209.85.223.195] (HELO mail-iw0-f195.google.com) (209.85.223.195) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 04 May 2010 14:36:47 +0000 Received: by iwn33 with SMTP id 33so5180289iwn.25 for ; Tue, 04 May 2010 07:36:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:x-mailer :mime-version:subject:date:references; bh=T0ZNVgIz4R5cOWAsMw1Xuz1jxOjz1L0T4wTb+5ibsGw=; b=ZSPRzXpNYpsCvqOmpw0SLw5cPhyt9rJIgAqfsj0lS2lOs70OB2hbSvX5tZrS8lZ+9U HKhQaqLULh2A3wvWeZqXPRIJO541qkrImQXff5rjOMT2T9VWPqQXVmPQiZrD2lJS4Ndy 3m2Q3Ju+xkJw2aD8ldxhipslMrRSnKRdJLQnQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type :content-transfer-encoding:x-mailer:mime-version:subject:date :references; b=vXuk3nn/e65imUUu0u9Umy18khLfqng7di8GLfvGEVEWYPc0ewD9Ovy9wioZU02LkZ EG0y6XnCy8ZboxExCFn9V2FGenNnDIibik5b5d/uaUGIkocUFiZSuUp6XJEeSzIpXjHz bcliRvxmJOVEVQz5Qt2a0XYU1XEVwkAlc4cvU= Received: by 10.231.173.77 with SMTP id o13mr1474577ibz.55.1272983784505; Tue, 04 May 2010 07:36:24 -0700 (PDT) Received: from [10.13.208.68] (mobile-166-137-141-066.mycingular.net [166.137.141.66]) by mx.google.com with ESMTPS id bk12sm631896ibb.8.2010.05.04.07.36.19 (version=TLSv1/SSLv3 cipher=RC4-MD5); Tue, 04 May 2010 07:36:23 -0700 (PDT) Message-Id: From: Miguel Verde To: "user@cassandra.apache.org" In-Reply-To: Content-Type: multipart/alternative; boundary=Apple-Mail-1-471296460 Content-Transfer-Encoding: 7bit X-Mailer: iPhone Mail (7D11) Mime-Version: 1.0 (iPhone Mail 7D11) Subject: Re: Best way to store millisecond-accurate data Date: Tue, 4 May 2010 09:35:55 -0500 References: <10706384-61F5-48C1-B86C-C473AAF3089D@ucsfcti.org> <3FF7CD22-CEF6-47D6-89E5-B238EA9DE964@gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-1-471296460 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes Content-Transfer-Encoding: quoted-printable One would use batch processes (e.g. through Hadoop) or client-side =20 aggregation, yes. In theory it would be possible to introduce runtime =20= sharding across rows into the Cassandra server side, but it's not part =20= of its design. In practice, one would want to model their data such that the 'row has =20= too much columns' scenario is prevented. On May 4, 2010, at 8:06 AM, =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB = =D0=A1=D0=B8=D0=BC=D0=B5=D0=BE=D0=BD=D0=BE=D0=B2 =20 wrote: > Hi Miguel, > I'd like to ask is it possible to have runtime sharding or rows in =20= > cassandra, i.e. if the row has too much new columns inserted then =20 > create another one row (let's say if the original timesharding is =20 > one day per row, then we would have two rows for that day). Maybe =20 > batch processes could do that. > Best regards, Daniel. > > 2010/4/24 Miguel Verde > TimeUUID's time component is measured in 100-nanosecond intervals. =20 > The library you use might calculate it with poorer accuracy or =20 > precision, but from a storage/comparison standpoint in Cassandra =20 > millisecond data is easily captured by it. > > One typical way of dealing with the data explosion of sampled time =20 > series data is to bucket/shard rows (i.e. Bob-20100423-=20 > bloodpressure) so that you put an upper bound on the row length. > > > On Apr 23, 2010, at 7:01 PM, Andrew Nguyen = > wrote: > > Hello, > > I am looking to store patient physiologic data in Cassandra - it's =20 > being collected at rates of 1 to 125 Hz. I'm thinking of storing =20 > the timestamps as the column names and the patient/parameter combo =20 > as the row key. For example, Bob is in the ICU and is currently =20 > having his blood pressure, intracranial pressure, and heart rate =20 > monitored. I'd like to collect this with the following row keys: > > Bob-bloodpressure > Bob-intracranialpressure > Bob-heartrate > > The column names would be timestamps but that's where my questions =20 > start: > > I'm not sure what the best data type and CompareWith would be. =46rom = =20 > my searching, it sounds like the TimeUUID may be suitable but isn't =20= > really designed for millisecond accuracy. My other thought is just =20= > to store them as strings (2010-04-23 10:23:45.016). While I space =20 > isn't the foremost concern, we will be collecting this data 24/7 so =20= > we'll be creating many columns over the long-term. > > I found https://issues.apache.org/jira/browse/CASSANDRA-16 which =20 > states that the entire row must fit in memory. Does this include =20 > the values as well as the column names? > > In considering the limits of cassandra and the best way to model =20 > this, we would be adding 3.9 billion rows per year (assuming 125 Hz =20= > @ 24/7). However, I can't really think of a better way to model =20 > this... So, am I thinking about this all wrong or am I on the right =20= > track? > > Thanks, > Andrew > --Apple-Mail-1-471296460 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: quoted-printable
One would use batch processes (e.g. = through Hadoop) or client-side aggregation, yes. In theory it would be = possible to introduce runtime sharding across rows into the Cassandra = server side, but it's not part of its design.

In = practice, one would want to model their data such that the 'row has too = much columns' scenario is prevented.

On May 4, 2010, = at 8:06 AM, =D0=94=D0=B0=D0=BD=D0=B8=D0=B5=D0=BB =D0=A1=D0=B8=D0=BC=D0=B5=D0= =BE=D0=BD=D0=BE=D0=B2 <dsimeonov@gmail.com> = wrote:

Hi = Miguel,
  I'd like to ask is it possible to have runtime = sharding or rows in cassandra, i.e. if the row has too much new columns = inserted then create another one row (let's say if the original = timesharding is one day per row, then we would have two rows for that = day). Maybe batch processes could do that. 
Best regards, Daniel.

2010/4/24 = Miguel Verde <miguelitovert@gmail.com>= ;
TimeUUID's time component is measured in 100-nanosecond intervals. The = library you use might calculate it with poorer accuracy or precision, = but from a storage/comparison standpoint in Cassandra millisecond data = is easily captured by it.

One typical way of dealing with the data explosion of sampled time = series data is to bucket/shard rows (i.e. Bob-20100423-bloodpressure) so = that you put an upper bound on the row length.


On Apr 23, 2010, at 7:01 PM, Andrew Nguyen <andrew-lists-cassandra@= ucsfcti.org> wrote:

Hello,

I am looking to store patient physiologic data in Cassandra - it's being = collected at rates of 1 to 125 Hz.  I'm thinking of storing the = timestamps as the column names and the patient/parameter combo as the = row key.  For example, Bob is in the ICU and is currently having = his blood pressure, intracranial pressure, and heart rate monitored. =  I'd like to collect this with the following row keys:

Bob-bloodpressure
Bob-intracranialpressure
Bob-heartrate

The column names would be timestamps but that's where my questions = start:

I'm not sure what the best data type and CompareWith would be. =  =46rom my searching, it sounds like the TimeUUID may be suitable = but isn't really designed for millisecond accuracy.  My other = thought is just to store them as strings (2010-04-23 10:23:45.016). =  While I space isn't the foremost concern, we will be collecting = this data 24/7 so we'll be creating many columns over the long-term.

I found https://issues= .apache.org/jira/browse/CASSANDRA-16 which states that the = entire row must fit in memory.  Does this include the values as = well as the column names?

In considering the limits of cassandra and the best way to model this, = we would be adding 3.9 billion rows per year (assuming 125 Hz @ 24/7). =  However, I can't really think of a better way to model this... =  So, am I thinking about this all wrong or am I on the right = track?

Thanks,
Andrew

= --Apple-Mail-1-471296460--