Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 56868 invoked from network); 21 Apr 2010 18:01:02 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 21 Apr 2010 18:01:02 -0000 Received: (qmail 7733 invoked by uid 500); 21 Apr 2010 18:01:01 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 7711 invoked by uid 500); 21 Apr 2010 18:01:00 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 7703 invoked by uid 99); 21 Apr 2010 18:01:00 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 18:01:00 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of miguelitovert@gmail.com designates 209.85.160.172 as permitted sender) Received: from [209.85.160.172] (HELO mail-gy0-f172.google.com) (209.85.160.172) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 21 Apr 2010 18:00:51 +0000 Received: by gyh4 with SMTP id 4so3957252gyh.31 for ; Wed, 21 Apr 2010 11:00:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:received:message-id:subject:from:to:content-type; bh=2ycNN6PcuhZKCBsaz5DkwyZarEFrdU8AIMrvhND4fLo=; b=xb/W9dkLUYeiSCcM5YwCgU123ZsAN5MCDYgyQ0rEjjQKz7leJc3ek6lnGVubnz7LbT tvkEBJLC8ix083E27Do2AEBRrfhqmKtOgZHeShk5UoH8Kt83EdRM4sGKvoF52sT2d74R CG4+Gl4CY7BJKDHKdyy+AxPQa/ZSYWwjvTh7I= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=me7aDNnK/L9VcJsPhRxb7N/vMbbPVYix2vlDEnx6wMHZDYEEl+5eAJfPB2voUvivEg I41QAYOY2hx+NMuTIr32eDc+f7PrffG9HfGoHJ6jJxm+ijxCnjnqoc1se2qChiJpo7uj /w4SbDcPn6lgcy59TsrZowjUyjSESIJCkkTJY= MIME-Version: 1.0 Received: by 10.150.138.17 with HTTP; Wed, 21 Apr 2010 11:00:30 -0700 (PDT) In-Reply-To: References: Date: Wed, 21 Apr 2010 13:00:30 -0500 Received: by 10.150.160.18 with SMTP id i18mr8431556ybe.100.1271872830368; Wed, 21 Apr 2010 11:00:30 -0700 (PDT) Message-ID: Subject: Re: Cassandra data model for financial data From: Miguel Verde To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=000e0cd7615a40ec7a0484c2f8fd X-Virus-Checked: Checked by ClamAV on apache.org --000e0cd7615a40ec7a0484c2f8fd Content-Type: text/plain; charset=ISO-8859-1 On Wed, Apr 21, 2010 at 12:17 PM, Steve Lihn wrote: > [...] > Design 1: Each attribute is a super column. Therefore each date is a > column. So we have: > > AAPL -> closingPrice -> { '2010-04-13' : 242, '2010-04-14': 245 } > AAPL -> volume -> { '2010-04-13' : 10.9m, '2010-04-14': 14.4m } > etc. > I would suggest not using this design, as each query involving an attribute will pull all dates for that attribute into memory on the server. i.e. getting the closingPrice for AAPL on '2010-04-13' would pull all closing prices for AAPL across all dates into memory. > > Design 2: Each date is a super column. Therefore each attribute is a > column. So we have: > > AAPL -> '2010-04-13' -> { closingPrice -> 242, volume -> 10.9m } > AAPL -> '2010-04-14' -> {closingPrice -> 245, volume -> 14.4m } > etc. > > The date column / superColumn will need Order Perserving Partitioner since > we are going to do a lot of range queries. Partitioners split up keys between nodes, the partitioner you use has no effect on your ability to query columns in a row. > Examples are: > Query 1: Give me the data between date1 and date2 for a set of tickers > (say, the 100 tickers in QQQ). > You could use http://wiki.apache.org/cassandra/API#multiget_slice for this. > Query 2: More often than not, the query is: Give me the data for the max > available dates (for each ticker) between date1 and date2 in a set of > tickers. > (Since not every day is traded, and we only want the most recent data, > given a range of dates.) > A http://wiki.apache.org/cassandra/API#SliceRange allows you to specify limits and ordering for columns you are slicing. --000e0cd7615a40ec7a0484c2f8fd Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
On Wed, Apr 21, 2010 at 12:17 PM, Steve Lihn <stevelihn@gmail.c= om> wrote:
[...]
=A0
Design 1: Each attribute is a su= per column. Therefore each date is a column. So we have:

AAPL -> = closingPrice -> { '2010-04-13' : 242, '2010-04-14': 245 = }
AAPL -> volume -> { '2010-04-13' : 10.9m, '2010-04-14'= ;: 14.4m }
etc.
I would suggest not using this design, as each query involving an attr= ibute will pull all dates=A0for that attribute into memory on the server.= =A0 i.e. getting the closingPrice for AAPL on '2010-04-13' would pu= ll all closing prices for AAPL across all dates into memory.
=A0

Design 2: Each date is a sup= er column. Therefore each attribute is a column. So we have:

AAPL -&= gt; '2010-04-13' -> { closingPrice -> 242, volume -> 10.9m= }
AAPL -> '2010-04-14' -> {closingPrice -> 245, volume ->= 14.4m }
etc.

The date column / superColumn will need Order Perse= rving Partitioner since we are going to do a lot of range queries.
=A0
Partitioners split up keys between nodes, the partitioner you use has = no effect on your ability to query columns in a row.
=A0
Examples are:
Query 1: Give m= e the data between date1 and date2 for a set of tickers (say, the 100 ticke= rs in QQQ).
=A0
Query 2: More often than not, th= e query is: Give me the data for the max available dates (for each ticker) = between date1 and date2 in a set of tickers.
(Since not every day is traded, and we only want the most recent data, give= n a range of dates.)
A http://w= iki.apache.org/cassandra/API#SliceRange=A0allows you to specify limits = and ordering for columns you are slicing.



=A0

--000e0cd7615a40ec7a0484c2f8fd--