Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAOxAL60aCrZV5OMejeU2rdg9=GW8WtD0BCGh=gZsq7mySfX8ng@mail.gmail.com>
References: 
 <CAAA+wSB5s9nz5is8tNMcErExddzx+91sWc2ZhsxVGLTYEHGD7w@mail.gmail.com>
 <CAOxAL63nhspmxUc6A3mV1eSnQa5DUp3+maMFFdqN9LAKQ1-8Ag@mail.gmail.com>
 <CA+4UHyPab0pBNKJLet5HTbGa7HwbP2D1bMxzRc2Ewc34xBCOaw@mail.gmail.com>
 <CAAA+wSBnitGcKk4xuMr_OBaXzFBS+tSavGTq2OypzvRznwrikQ@mail.gmail.com>
 <CAOxAL60aCrZV5OMejeU2rdg9=GW8WtD0BCGh=gZsq7mySfX8ng@mail.gmail.com>
From: Guillaume Charhon <guillaume@databerries.com>
Date: Mon, 9 Nov 2015 16:53:15 +0100
Message-ID: 
 <CAAA+wSD=8y-pD11hmGodGWKMaNrOByjm+EiHobLbZ-81ENXMpA@mail.gmail.com>
Subject: Re: How to organize a timeseries by device?
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a114354327f31bc05241d9844

--001a114354327f31bc05241d9844
Content-Type: text/plain; charset=UTF-8

For the first table: (device_id, timestamp), should I add a bucket even if
I know I might have millions of events per device but never billions?

On Mon, Nov 9, 2015 at 4:37 PM, Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> Cassandra is good at two kinds of queries: 1) access a specific row by a
> specific key, and 2) Access a slice or consecutive sequence of rows within
> a given partition.
>
> It is recommended to avoid ALLOW FILTERING. If it happens to work well for
> you, great, go for it, but if it doesn't then simply don't do it. Best to
> redesign your data model to play to Cassandra's strengths.
>
> If you bucket the time-based table, do a separate query for each time
> bucket.
>
> -- Jack Krupansky
>
> On Mon, Nov 9, 2015 at 10:16 AM, Guillaume Charhon <
> guillaume@databerries.com> wrote:
>
>> Kai, Jack,
>>
>> On 1., should the bucket be a STRING with a date format or do I have a
>> better option ? For (device_id, bucket, timestamp), did you mean
>> ((device_id, bucket), timestamp) ?
>>
>> On 2., what are the risks of timeout ? I currently have this warning:
>> "Cannot execute this query as it might involve data filtering and thus may
>> have unpredictable performance. If you want to execute this query despite
>> the performance unpredictability, use ALLOW FILTERING".
>>
>> On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <depend@gmail.com> wrote:
>>
>>> 1. Don't make your partition unbound. It's tempting to just use
>>> (device_id, timestamp). But soon or later you will have problem when time
>>> goes by. You can keep the partition bound by using (device_id, bucket,
>>> timestamp). Use hour, day, month or even year like Jack mentioned depending
>>> on the size of data.
>>>
>>> 2. As to your specific query, for a given partition and a time range, C*
>>> doesn't need to load the whole partition then filter. It only retrieves the
>>> slice within the time range from disk because the data is clustered by
>>> timestamp.
>>>
>>> On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <jack.krupansky@gmail.com
>>> > wrote:
>>>
>>>> The general rule in Cassandra data modeling is to look at all of your
>>>> queries first and then to declare a table for each query, even if that
>>>> means storing multiple copies of the data. So, create a second table with
>>>> bucketed time as the partition key (hour, 15 minutes, or whatever time
>>>> interval makes sense to give 1 to 10 megabytes per partition) and time and
>>>> device as the clustering keys.
>>>>
>>>> Or, consider DSE SEarch  and then you can do whatever ad hoc queries
>>>> you want using Solr. Or Stratio or TupleJump Stargate for an open source
>>>> Lucene plugin.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
>>>> guillaume@databerries.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> We are currently storing geolocation events (about 1 per 5 minutes)
>>>>> for each device we track. We currently have 2 TB of data. I would like to
>>>>> store the device_id, the timestamp of the event, latitude and longitude. I
>>>>> though about using the device_id as the partition key and timestamp as the
>>>>> clustering column. It is great as events are naturally grouped by device
>>>>> (very useful for our Spark jobs). However, if I would like to retrieve all
>>>>> events of all devices of the last week I understood that Cassandra will
>>>>> need to load all data and filter which does not seems to be clean on the
>>>>> long term.
>>>>>
>>>>> How should I create my model?
>>>>>
>>>>> Best Regards
>>>>>
>>>>
>>>>
>>>
>>
>

--001a114354327f31bc05241d9844
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">For the first table:=C2=A0<span style=3D"font-size:12.8px"=
>(device_id, timestamp), should I add a bucket even if I know I might have =
millions of events per device but never billions?</span></div><div class=3D=
"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Nov 9, 2015 at 4:37 PM=
, Jack Krupansky <span dir=3D"ltr">&lt;<a href=3D"mailto:jack.krupansky@gma=
il.com" target=3D"_blank">jack.krupansky@gmail.com</a>&gt;</span> wrote:<br=
><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1=
px #ccc solid;padding-left:1ex"><div dir=3D"ltr">Cassandra is good at two k=
inds of queries: 1) access a specific row by a specific key, and 2) Access =
a slice or consecutive sequence of rows within a given partition.<div><br><=
/div><div>It is recommended to avoid ALLOW FILTERING. If it happens to work=
 well for you, great, go for it, but if it doesn&#39;t then simply don&#39;=
t do it. Best to redesign your data model to play to Cassandra&#39;s streng=
ths.</div><div><br></div><div>If you bucket the time-based table, do a sepa=
rate query for each time bucket.</div></div><div class=3D"gmail_extra"><spa=
n class=3D"HOEnZb"><font color=3D"#888888"><br clear=3D"all"><div><div><div=
 dir=3D"ltr">-- Jack Krupansky</div></div></div></font></span><div><div cla=
ss=3D"h5">
<br><div class=3D"gmail_quote">On Mon, Nov 9, 2015 at 10:16 AM, Guillaume C=
harhon <span dir=3D"ltr">&lt;<a href=3D"mailto:guillaume@databerries.com" t=
arget=3D"_blank">guillaume@databerries.com</a>&gt;</span> wrote:<br><blockq=
uote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc =
solid;padding-left:1ex"><div dir=3D"ltr">Kai, Jack,<div><br></div><div>On 1=
., should the bucket be a STRING with a date format or do I have a better o=
ption ? For=C2=A0<span style=3D"font-size:12.8px">(device_id, bucket, times=
tamp), did you mean ((device_id, bucket), timestamp) ?=C2=A0</span></div><d=
iv><br></div><div>On 2., what are the risks of timeout ? I currently have t=
his warning: &quot;Cannot execute this query as it might involve data filte=
ring and thus may have unpredictable performance. If you want to execute th=
is query despite the performance unpredictability, use ALLOW FILTERING&quot=
;.=C2=A0</div></div><div><div><div class=3D"gmail_extra"><br><div class=3D"=
gmail_quote">On Mon, Nov 9, 2015 at 3:02 PM, Kai Wang <span dir=3D"ltr">&lt=
;<a href=3D"mailto:depend@gmail.com" target=3D"_blank">depend@gmail.com</a>=
&gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0=
 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div>=
1. Don&#39;t make your partition unbound. It&#39;s tempting to just use (de=
vice_id, timestamp). But soon or later you will have problem when time goes=
 by. You can keep the partition bound by using (device_id, bucket, timestam=
p). Use hour, day, month or even year like Jack mentioned depending on the =
size of data.<br><br></div>2. As to your specific query, for a given partit=
ion and a time range, C* doesn&#39;t need to load the whole partition then =
filter. It only retrieves the slice within the time range from disk because=
 the data is clustered by timestamp.<br></div><div><div><div class=3D"gmail=
_extra"><br><div class=3D"gmail_quote">On Mon, Nov 9, 2015 at 8:13 AM, Jack=
 Krupansky <span dir=3D"ltr">&lt;<a href=3D"mailto:jack.krupansky@gmail.com=
" target=3D"_blank">jack.krupansky@gmail.com</a>&gt;</span> wrote:<br><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #cc=
c solid;padding-left:1ex"><div dir=3D"ltr">The general rule in Cassandra da=
ta modeling is to look at all of your queries first and then to declare a t=
able for each query, even if that means storing multiple copies of the data=
. So, create a second table with bucketed time as the partition key (hour, =
15 minutes, or whatever time interval makes sense to give 1 to 10 megabytes=
 per partition) and time and device as the clustering keys.<div><br></div><=
div>Or, consider DSE SEarch =C2=A0and then you can do whatever ad hoc queri=
es you want using Solr. Or Stratio or TupleJump Stargate for an open source=
 Lucene plugin.</div></div><div class=3D"gmail_extra"><span><font color=3D"=
#888888"><br clear=3D"all"><div><div><div dir=3D"ltr">-- Jack Krupansky</di=
v></div></div></font></span><div><div>
<br><div class=3D"gmail_quote">On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Ch=
arhon <span dir=3D"ltr">&lt;<a href=3D"mailto:guillaume@databerries.com" ta=
rget=3D"_blank">guillaume@databerries.com</a>&gt;</span> wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div dir=3D"ltr">Hello,=C2=A0<div><br></div><div><di=
v style=3D"font-size:12.8px">We are currently storing geolocation events (a=
bout 1 per 5 minutes) for each device we track. We currently have 2 TB of d=
ata. I would like to store the device_id, the timestamp of the event, latit=
ude and longitude. I though about using the device_id as the partition key =
and timestamp as the clustering column. It is great as events are naturally=
 grouped by device (very useful for our Spark jobs). However, if I would li=
ke to retrieve all events of all devices of the last week I understood that=
 Cassandra will need to load all data and filter which does not seems to be=
 clean on the long term.=C2=A0</div></div><div><br></div><div>How should I =
create my model?=C2=A0<br><br>Best Regards</div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--001a114354327f31bc05241d9844--