Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2DB4C189FF for ; Mon, 9 Nov 2015 15:53:45 +0000 (UTC) Received: (qmail 84362 invoked by uid 500); 9 Nov 2015 15:53:42 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 84318 invoked by uid 500); 9 Nov 2015 15:53:42 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 84306 invoked by uid 99); 9 Nov 2015 15:53:42 -0000 Received: from Unknown (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 09 Nov 2015 15:53:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 4CD211A0B2C for ; Mon, 9 Nov 2015 15:53:41 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.981 X-Spam-Level: ** X-Spam-Status: No, score=2.981 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=3, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=databerries_com.20150623.gappssmtp.com Received: from mx1-us-east.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id bWn3zTk8KUUe for ; Mon, 9 Nov 2015 15:53:36 +0000 (UTC) Received: from mail-wm0-f50.google.com (mail-wm0-f50.google.com [74.125.82.50]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id B657243A01 for ; Mon, 9 Nov 2015 15:53:35 +0000 (UTC) Received: by wmww144 with SMTP id w144so38149563wmw.0 for ; Mon, 09 Nov 2015 07:53:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=databerries_com.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=toIcy17Jo+HZJcmU2NiNdVYHW2KllziZh3B1Qivz5Q0=; b=kA8lMf4Q6c2Rcxf0mSZtmTsrp8LNLGUpvTp+AaKjlKY31/rwulFkr9FyrMZJXPL1Km qNXtMU/zHT/yrQLdyYuA0ckCVhqWnjcQiBoDKH3IMtrO1dq4vFBSQ00EM11NCOw0BWIa eq4JAgD3waF4/uhdclikdJSB0xvJkc2E+FPaX42wwX6JRo1q8Qq7DV0R9CAgnmrC5O7G ICK7s7J0NrO39B3xknneaUXAAhYwBRPSjTLQoohp3au33nQUhofnfl00OYA0JeAiJRM3 Vny4VC2Hue2JPaiwfTw4lkcBN0rCtI+EHGb3omdJWiJYvPqn5Dy5nx+YKdGGmxXCPb+M efKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:content-type; bh=toIcy17Jo+HZJcmU2NiNdVYHW2KllziZh3B1Qivz5Q0=; b=G6ukoIPnM4CIqKDOmTDu5rYIi7RgbRdakRQ1CRVPm05tGqzQ3xU36Jqtk4XCembkMZ GyU1grhwy2cV+yMM+BqiXgjgO7ImG6JaVd6Q4RLVXNWImCmcJL2ZrZQXEohDUdcSD/R+ L4uj9kIvfs3ue0oxKigYLWhtfWHdhVgty0r4KorCCcd0DQJ0eemyeobUcq0fFZHn/HAh 1D/armySs599tSP2Rdhyzke2M0MJgiRJIvB2kiLCPOay8VMnRBLeEwx8FabnYHc6gyuI /RE65Tv6W5Mbn1SJi1U5PZonKK3sw4UTbm3QGPcYs5pN4nb09ZCw9oZUwPmxXGeWOqfR BCcg== X-Gm-Message-State: ALoCoQkySDua9P95MrAMbBK00XSV6znbS7XUOVoseppSWc+uUEc2FUTGrnYsvSUGfYyjvAvxQSeq X-Received: by 10.28.52.17 with SMTP id b17mr25975350wma.81.1447084414663; Mon, 09 Nov 2015 07:53:34 -0800 (PST) MIME-Version: 1.0 Received: by 10.27.158.16 with HTTP; Mon, 9 Nov 2015 07:53:15 -0800 (PST) In-Reply-To: References: From: Guillaume Charhon Date: Mon, 9 Nov 2015 16:53:15 +0100 Message-ID: Subject: Re: How to organize a timeseries by device? To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=001a114354327f31bc05241d9844 --001a114354327f31bc05241d9844 Content-Type: text/plain; charset=UTF-8 For the first table: (device_id, timestamp), should I add a bucket even if I know I might have millions of events per device but never billions? On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky wrote: > Cassandra is good at two kinds of queries: 1) access a specific row by a > specific key, and 2) Access a slice or consecutive sequence of rows within > a given partition. > > It is recommended to avoid ALLOW FILTERING. If it happens to work well for > you, great, go for it, but if it doesn't then simply don't do it. Best to > redesign your data model to play to Cassandra's strengths. > > If you bucket the time-based table, do a separate query for each time > bucket. > > -- Jack Krupansky > > On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon < > guillaume@databerries.com> wrote: > >> Kai, Jack, >> >> On 1., should the bucket be a STRING with a date format or do I have a >> better option ? For (device_id, bucket, timestamp), did you mean >> ((device_id, bucket), timestamp) ? >> >> On 2., what are the risks of timeout ? I currently have this warning: >> "Cannot execute this query as it might involve data filtering and thus may >> have unpredictable performance. If you want to execute this query despite >> the performance unpredictability, use ALLOW FILTERING". >> >> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang wrote: >> >>> 1. Don't make your partition unbound. It's tempting to just use >>> (device_id, timestamp). But soon or later you will have problem when time >>> goes by. You can keep the partition bound by using (device_id, bucket, >>> timestamp). Use hour, day, month or even year like Jack mentioned depending >>> on the size of data. >>> >>> 2. As to your specific query, for a given partition and a time range, C* >>> doesn't need to load the whole partition then filter. It only retrieves the >>> slice within the time range from disk because the data is clustered by >>> timestamp. >>> >>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky >> > wrote: >>> >>>> The general rule in Cassandra data modeling is to look at all of your >>>> queries first and then to declare a table for each query, even if that >>>> means storing multiple copies of the data. So, create a second table with >>>> bucketed time as the partition key (hour, 15 minutes, or whatever time >>>> interval makes sense to give 1 to 10 megabytes per partition) and time and >>>> device as the clustering keys. >>>> >>>> Or, consider DSE SEarch and then you can do whatever ad hoc queries >>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source >>>> Lucene plugin. >>>> >>>> -- Jack Krupansky >>>> >>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon < >>>> guillaume@databerries.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> We are currently storing geolocation events (about 1 per 5 minutes) >>>>> for each device we track. We currently have 2 TB of data. I would like to >>>>> store the device_id, the timestamp of the event, latitude and longitude. I >>>>> though about using the device_id as the partition key and timestamp as the >>>>> clustering column. It is great as events are naturally grouped by device >>>>> (very useful for our Spark jobs). However, if I would like to retrieve all >>>>> events of all devices of the last week I understood that Cassandra will >>>>> need to load all data and filter which does not seems to be clean on the >>>>> long term. >>>>> >>>>> How should I create my model? >>>>> >>>>> Best Regards >>>>> >>>> >>>> >>> >> > --001a114354327f31bc05241d9844 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
For the first table:=C2=A0(device_id, timestamp), should I add a bucket even if I know I might have = millions of events per device but never billions?

On Mon, Nov 9, 2015 at 4:37 PM= , Jack Krupansky <jack.krupansky@gmail.com> wrote:
Cassandra is good at two k= inds of queries: 1) access a specific row by a specific key, and 2) Access = a slice or consecutive sequence of rows within a given partition.

<= /div>
It is recommended to avoid ALLOW FILTERING. If it happens to work= well for you, great, go for it, but if it doesn't then simply don'= t do it. Best to redesign your data model to play to Cassandra's streng= ths.

If you bucket the time-based table, do a sepa= rate query for each time bucket.

-- Jack Krupansky

On Mon, Nov 9, 2015 at 10:16 AM, Guillaume C= harhon <guillaume@databerries.com> wrote:
Kai, Jack,

On 1= ., should the bucket be a STRING with a date format or do I have a better o= ption ? For=C2=A0(device_id, bucket, times= tamp), did you mean ((device_id, bucket), timestamp) ?=C2=A0

On 2., what are the risks of timeout ? I currently have t= his warning: "Cannot execute this query as it might involve data filte= ring and thus may have unpredictable performance. If you want to execute th= is query despite the performance unpredictability, use ALLOW FILTERING"= ;.=C2=A0

On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <= ;depend@gmail.com= > wrote:
= 1. Don't make your partition unbound. It's tempting to just use (de= vice_id, timestamp). But soon or later you will have problem when time goes= by. You can keep the partition bound by using (device_id, bucket, timestam= p). Use hour, day, month or even year like Jack mentioned depending on the = size of data.

2. As to your specific query, for a given partit= ion and a time range, C* doesn't need to load the whole partition then = filter. It only retrieves the slice within the time range from disk because= the data is clustered by timestamp.

On Mon, Nov 9, 2015 at 8:13 AM, Jack= Krupansky <jack.krupansky@gmail.com> wrote:
The general rule in Cassandra da= ta modeling is to look at all of your queries first and then to declare a t= able for each query, even if that means storing multiple copies of the data= . So, create a second table with bucketed time as the partition key (hour, = 15 minutes, or whatever time interval makes sense to give 1 to 10 megabytes= per partition) and time and device as the clustering keys.

<= div>Or, consider DSE SEarch =C2=A0and then you can do whatever ad hoc queri= es you want using Solr. Or Stratio or TupleJump Stargate for an open source= Lucene plugin.

-- Jack Krupansky

On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Ch= arhon <guillaume@databerries.com> wrote:
Hello,=C2=A0

We are currently storing geolocation events (a= bout 1 per 5 minutes) for each device we track. We currently have 2 TB of d= ata. I would like to store the device_id, the timestamp of the event, latit= ude and longitude. I though about using the device_id as the partition key = and timestamp as the clustering column. It is great as events are naturally= grouped by device (very useful for our Spark jobs). However, if I would li= ke to retrieve all events of all devices of the last week I understood that= Cassandra will need to load all data and filter which does not seems to be= clean on the long term.=C2=A0

How should I = create my model?=C2=A0

Best Regards





--001a114354327f31bc05241d9844--