From user-return-30560-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Tue Dec 11 20:49:15 2012 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 19DDEDF91 for ; Tue, 11 Dec 2012 20:49:15 +0000 (UTC) Received: (qmail 13323 invoked by uid 500); 11 Dec 2012 20:49:12 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 13297 invoked by uid 500); 11 Dec 2012 20:49:12 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 13289 invoked by uid 99); 11 Dec 2012 20:49:12 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 20:49:12 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a94.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Dec 2012 20:49:05 +0000 Received: from homiemail-a94.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a94.g.dreamhost.com (Postfix) with ESMTP id EA71638A06F for ; Tue, 11 Dec 2012 12:48:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h=from :content-type:message-id:mime-version:subject:date:references:to :in-reply-to; s=thelastpickle.com; bh=AiZgcRNoJuKO0Us1bTTaL4tqzU w=; b=hbMUyqfZeRTOMr0IN26ngNTvbBjZQovHh6XBUwJwRRxNh0tzReC9L+DVts xvFv5mY5KmtvBxSM1fycNhJMk61E1TQPJz6LhHR4H/UF1lsvlv0n5Nl/uCTDBzWX UFVSk19JWQ7TFVTZfsAThcFswDsUX6iGwmKz7GvQV8dAXhEJU= Received: from [172.16.1.7] (unknown [203.86.207.101]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a94.g.dreamhost.com (Postfix) with ESMTPSA id 4486838A059 for ; Tue, 11 Dec 2012 12:48:42 -0800 (PST) From: aaron morton Content-Type: multipart/alternative; boundary="Apple-Mail=_4B45434A-39AF-4F98-8175-561AEA6A6226" Message-Id: <7BCDDB0A-230C-426D-B15C-8F9EF1C60E9A@thelastpickle.com> Mime-Version: 1.0 (Mac OS X Mail 6.2 \(1499\)) Subject: Re: Selecting rows efficiently from a Cassandra CF containing time series data Date: Wed, 12 Dec 2012 09:48:42 +1300 References: To: user@cassandra.apache.org In-Reply-To: X-Mailer: Apple Mail (2.1499) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_4B45434A-39AF-4F98-8175-561AEA6A6226 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=windows-1252 Couple of ideas, one is to multiplex the even log stream (using flume or = kafka) and feed it straight into your secondary system. The event system = should allow you to rate limit inserts if that is a concern.=20 The other is to use partitioning. Group the log entries per user into some sensible partition, e.g. per = day or per week. So your row key is "user_id : partition_start".=20 You can then keep a record of dirty partitions, this can be tricky = depending on scale. It could be a row for each user, and a column for = each dirty partition. Loading the delta then requires a range scan over = the dirty partitions CF to read all rows, and then reading the dirty = partition for the user. You would want to look at a low GC Grace and LDB = for the dirty partitions CF.=20 Hope that helps. =20 ----------------- Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 12/12/2012, at 7:20 AM, "Hiller, Dean" wrote: > Wide rows does not work well if you start getting past 10,000,000 = columns though so be very very careful there. PlayOrm does some wide = row indices for us and each row length is as large as the number of rows = in a partition so without playorm you could do partitioning yourself by = the way=85.It's as simple as store every row and add to the partitions = index. >=20 > Later, > Dean >=20 >=20 > From: Andrey Ilinykh > > Reply-To: = "user@cassandra.apache.org" = > > Date: Tuesday, December 11, 2012 10:45 AM > To: "user@cassandra.apache.org" = > > Subject: Re: Selecting rows efficiently from a Cassandra CF containing = time series data >=20 > would consider to use wide rows. If you add timestamp to your column = name you have naturally sorted data. You can easily select any time = range without any indexes. --Apple-Mail=_4B45434A-39AF-4F98-8175-561AEA6A6226 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=windows-1252

You can = then keep a record of dirty partitions, this can be tricky depending on = scale. It could be a row for each user, and a column for each dirty = partition. Loading the delta then requires a range scan over the dirty = partitions CF to read all rows, and then reading the dirty partition for = the user. You would want to look at a low GC Grace and LDB for the dirty = partitions CF. 

Hope that helps. =  


http://www.thelastpickle.com

On 12/12/2012, at 7:20 AM, "Hiller, Dean" <Dean.Hiller@nrel.gov> = wrote:

Wide rows does not work well if you start getting past = 10,000,000 columns though so be very very careful there.  PlayOrm = does some wide row indices for us and each row length is as large as the = number of rows in a partition so without playorm you could do = partitioning yourself by the way=85.It's as simple as store every row = and add to the partitions index.

Later,
Dean


From: = Andrey Ilinykh <ailinykh@gmail.com<mailto:ailinykh@gmail.com>>Reply-To: "user@cassandra.apache.org<= ;mailto:user@cassandra.apache.org= >" <user@cassandra.apache.org<= ;mailto:user@cassandra.apache.org= >>
Date: Tuesday, December 11, 2012 10:45 AM
To: "user@cassandra.apache.org<= ;mailto:user@cassandra.apache.org= >" <user@cassandra.apache.org<= ;mailto:user@cassandra.apache.org= >>
Subject: Re: Selecting rows efficiently from a Cassandra = CF containing time series data

would consider to use wide rows. = If you add timestamp to your column name you have naturally sorted data. = You can easily select any time range without any = indexes.

= --Apple-Mail=_4B45434A-39AF-4F98-8175-561AEA6A6226--