Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 770BB1854B for ; Mon, 9 Nov 2015 14:02:45 +0000 (UTC) Received: (qmail 40655 invoked by uid 500); 9 Nov 2015 14:02:43 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 40624 invoked by uid 500); 9 Nov 2015 14:02:43 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 40614 invoked by uid 99); 9 Nov 2015 14:02:42 -0000 Received: from Unknown (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2015 14:02:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 75E80C070B for ; Mon, 9 Nov 2015 14:02:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.899 X-Spam-Level: ** X-Spam-Status: No, score=2.899 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-eu-west.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id kZGNCJ-8F5rh for ; Mon, 9 Nov 2015 14:02:36 +0000 (UTC) Received: from mail-qg0-f48.google.com (mail-qg0-f48.google.com [209.85.192.48]) by mx1-eu-west.apache.org (ASF Mail Server at mx1-eu-west.apache.org) with ESMTPS id AE59920C6E for ; Mon, 9 Nov 2015 14:02:35 +0000 (UTC) Received: by qgeb1 with SMTP id b1so90321695qge.1 for ; Mon, 09 Nov 2015 06:02:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=a1NL3ATAWXjAd5GO8lM0NCokn1+/cbiU0X60r9raEZI=; b=QVkj3vvDR9t50aUTyawERZVTLw1yuYn9R3gwG3CFLM0EQdHNhw/oWfGa78ydTifDh5 JfqsvD++rBVaowt6wBaX3UK57YILHymemjffsjtnd3/UhwFCQweyKeVoSceH2nZXQ5ib eps6hLH+PDmwgTJMaw2LSEltqplzOI+mm8SrcxB7Zuxl0HuwBK+Ttive+AQpr34jgaHw i4FlB+Zs5YTJ9+Rjj9jsTswyZuF8WEf2zHjb1AMcRagBUdXHrAwJ4ljOwRj9B2/NWopS dtYdVfWWxA4hFwegNZfVsSJXw/FGWGCL7QvWnFuG42Y8Du/qwVa7GuhHgGejHDqsgrgF JYzA== MIME-Version: 1.0 X-Received: by 10.140.102.201 with SMTP id w67mr29795595qge.82.1447077754779; Mon, 09 Nov 2015 06:02:34 -0800 (PST) Received: by 10.55.106.198 with HTTP; Mon, 9 Nov 2015 06:02:34 -0800 (PST) In-Reply-To: References: Date: Mon, 9 Nov 2015 09:02:34 -0500 Message-ID: Subject: Re: How to organize a timeseries by device? From: Kai Wang To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a11c15fd4893eb705241c0b44 --001a11c15fd4893eb705241c0b44 Content-Type: text/plain; charset=UTF-8 1. Don't make your partition unbound. It's tempting to just use (device_id, timestamp). But soon or later you will have problem when time goes by. You can keep the partition bound by using (device_id, bucket, timestamp). Use hour, day, month or even year like Jack mentioned depending on the size of data. 2. As to your specific query, for a given partition and a time range, C* doesn't need to load the whole partition then filter. It only retrieves the slice within the time range from disk because the data is clustered by timestamp. On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky wrote: > The general rule in Cassandra data modeling is to look at all of your > queries first and then to declare a table for each query, even if that > means storing multiple copies of the data. So, create a second table with > bucketed time as the partition key (hour, 15 minutes, or whatever time > interval makes sense to give 1 to 10 megabytes per partition) and time and > device as the clustering keys. > > Or, consider DSE SEarch and then you can do whatever ad hoc queries you > want using Solr. Or Stratio or TupleJump Stargate for an open source Lucene > plugin. > > -- Jack Krupansky > > On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon < > guillaume@databerries.com> wrote: > >> Hello, >> >> We are currently storing geolocation events (about 1 per 5 minutes) for >> each device we track. We currently have 2 TB of data. I would like to store >> the device_id, the timestamp of the event, latitude and longitude. I though >> about using the device_id as the partition key and timestamp as the >> clustering column. It is great as events are naturally grouped by device >> (very useful for our Spark jobs). However, if I would like to retrieve all >> events of all devices of the last week I understood that Cassandra will >> need to load all data and filter which does not seems to be clean on the >> long term. >> >> How should I create my model? >> >> Best Regards >> > > --001a11c15fd4893eb705241c0b44 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
1. Don't make your partition unbound. It's te= mpting to just use (device_id, timestamp). But soon or later you will have = problem when time goes by. You can keep the partition bound by using (devic= e_id, bucket, timestamp). Use hour, day, month or even year like Jack menti= oned depending on the size of data.

2. As to your specific que= ry, for a given partition and a time range, C* doesn't need to load the= whole partition then filter. It only retrieves the slice within the time r= ange from disk because the data is clustered by timestamp.

On Mon, Nov 9, 2015 at 8= :13 AM, Jack Krupansky <jack.krupansky@gmail.com> wro= te:
The general rule in = Cassandra data modeling is to look at all of your queries first and then to= declare a table for each query, even if that means storing multiple copies= of the data. So, create a second table with bucketed time as the partition= key (hour, 15 minutes, or whatever time interval makes sense to give 1 to = 10 megabytes per partition) and time and device as the clustering keys.
Or, consider DSE SEarch =C2=A0and then you can do whatever = ad hoc queries you want using Solr. Or Stratio or TupleJump Stargate for an= open source Lucene plugin.

-- Jack Krupansky

On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Ch= arhon <guillaume@databerries.com> wrote:
Hello,=C2=A0

We are currently storing geolocation events (a= bout 1 per 5 minutes) for each device we track. We currently have 2 TB of d= ata. I would like to store the device_id, the timestamp of the event, latit= ude and longitude. I though about using the device_id as the partition key = and timestamp as the clustering column. It is great as events are naturally= grouped by device (very useful for our Spark jobs). However, if I would li= ke to retrieve all events of all devices of the last week I understood that= Cassandra will need to load all data and filter which does not seems to be= clean on the long term.=C2=A0

How should I = create my model?=C2=A0

Best Regards


--001a11c15fd4893eb705241c0b44--