Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 09D3AD29E for ; Tue, 11 Dec 2012 17:46:10 +0000 (UTC) Received: (qmail 58605 invoked by uid 500); 11 Dec 2012 17:46:07 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 58379 invoked by uid 500); 11 Dec 2012 17:46:07 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 58362 invoked by uid 99); 11 Dec 2012 17:46:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 17:46:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of ailinykh@gmail.com designates 209.85.217.172 as permitted sender) Received: from [209.85.217.172] (HELO mail-lb0-f172.google.com) (209.85.217.172) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 17:46:00 +0000 Received: by mail-lb0-f172.google.com with SMTP id y2so3454772lbk.31 for ; Tue, 11 Dec 2012 09:45:38 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=unK47K7VJE/rhnMtHzsjkAP5H1afjPgNmMpuwDoV8kI=; b=jrnXMsCL7pYI1/DN9aUJiZWyKxLN5QbY/p87JWRHB5B7Xqjrdc02xGx2Ra6Vu6RjFq q10ZGGujgi1aktiaa2zOIRNQz06x2t69Vk1iVj4Camisdfe8knOpfI1gHqIYZfuVoo/9 5pWXdq3JQKzeJWY9OnRajbFFH45miPYGwhzciOG9HwXkGZjupQ+q1KUVmFfxwCV9ErOm urK+VBFWdDiqTpYJ0i03a8ovu21bnJNBsBVhhk+GUX4s21xGDa6vMw3Cc9GnnUXHTJvY 91usBkjHyk9fTHE4l7n1bHHgLmVN2qFNknsjdVZla3Oz9O9ukxAL4bDwhnjSKISgohBQ wcLg== MIME-Version: 1.0 Received: by 10.112.28.98 with SMTP id a2mr7821591lbh.110.1355247938788; Tue, 11 Dec 2012 09:45:38 -0800 (PST) Received: by 10.114.28.34 with HTTP; Tue, 11 Dec 2012 09:45:38 -0800 (PST) In-Reply-To: References: Date: Tue, 11 Dec 2012 09:45:38 -0800 Message-ID: Subject: Re: Selecting rows efficiently from a Cassandra CF containing time series data From: Andrey Ilinykh To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=bcaec554de46f95dd804d0973fbf X-Virus-Checked: Checked by ClamAV on apache.org --bcaec554de46f95dd804d0973fbf Content-Type: text/plain; charset=ISO-8859-1 I would consider to use wide rows. If you add timestamp to your column name you have naturally sorted data. You can easily select any time range without any indexes. Thank you, Andrey On Tue, Dec 11, 2012 at 6:23 AM, Chin Ko wrote: > I would like to get some opinions on how to select an incremental range of > rows efficiently from a Cassandra CF containing time series data. > > Background: > We have a web application that uses a Cassandra CF as logging storage. We > insert a row into the CF for every "event" of each user of the web > application. The row key is timestamp+userid. The column values are > unstructured data. We only insert rows but never update or delete any rows > in the CF. > > Data volume: > The CF grows by about 0.5 million rows per day. We have a 4 node cluster > and use the RandomPartitioner to spread the rows across the nodes. > > Requirements: > There is a need to transfer the Cassandra data to another relational > database periodically. Due to the large size of the CF, instead of > truncating the relational table and reloading all rows into it each time, > we plan to run a job to select the "delta" rows since the last run and > insert them into the relational database. > > We would like to have some flexibility in how often the data transfer job > is done. It may be run several times each day, or it may be not run at all > on a day. > > Options considered: > - We are using RandomPartitioner, so range scan by row key is not feasible. > - Add a secondary index on the timestamp column, but reading rows via > secondary index still requires an equality condition and does not support > range scan. > - Add a secondary index on a column containing the date and hour of the > timestamp. Iterate each hour between the time job was last run and now. > Fetch all rows of each hour. > > I would appreciate any ideas of other design options of the Cassandra CF > to enable extracting the rows efficiently. > > Besides Java, has anyone used any ETL tools to do this kind of delta > extraction from Cassandra? > > Thanks, > Chin --bcaec554de46f95dd804d0973fbf Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I would consider to use wide rows. If you add timestamp to your column name= you have naturally sorted data. You can easily select any time range witho= ut any indexes.

Thank you,
=A0 Andrey


On Tue, Dec 11, 2012 at 6:23 AM, Chin Ko= <cko2223@gmail.com> wrote: