Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
MIME-Version: 1.0
In-Reply-To: 
 <CAOxAL63nhspmxUc6A3mV1eSnQa5DUp3+maMFFdqN9LAKQ1-8Ag@mail.gmail.com>
References: 
 <CAAA+wSB5s9nz5is8tNMcErExddzx+91sWc2ZhsxVGLTYEHGD7w@mail.gmail.com>
	<CAOxAL63nhspmxUc6A3mV1eSnQa5DUp3+maMFFdqN9LAKQ1-8Ag@mail.gmail.com>
Date: Mon, 9 Nov 2015 09:02:34 -0500
Message-ID: 
 <CA+4UHyPab0pBNKJLet5HTbGa7HwbP2D1bMxzRc2Ewc34xBCOaw@mail.gmail.com>
Subject: Re: How to organize a timeseries by device?
From: Kai Wang <depend@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=001a11c15fd4893eb705241c0b44

--001a11c15fd4893eb705241c0b44
Content-Type: text/plain; charset=UTF-8

1. Don't make your partition unbound. It's tempting to just use (device_id,
timestamp). But soon or later you will have problem when time goes by. You
can keep the partition bound by using (device_id, bucket, timestamp). Use
hour, day, month or even year like Jack mentioned depending on the size of
data.

2. As to your specific query, for a given partition and a time range, C*
doesn't need to load the whole partition then filter. It only retrieves the
slice within the time range from disk because the data is clustered by
timestamp.

On Mon, Nov 9, 2015 at 8:13 AM, Jack Krupansky <jack.krupansky@gmail.com>
wrote:

> The general rule in Cassandra data modeling is to look at all of your
> queries first and then to declare a table for each query, even if that
> means storing multiple copies of the data. So, create a second table with
> bucketed time as the partition key (hour, 15 minutes, or whatever time
> interval makes sense to give 1 to 10 megabytes per partition) and time and
> device as the clustering keys.
>
> Or, consider DSE SEarch  and then you can do whatever ad hoc queries you
> want using Solr. Or Stratio or TupleJump Stargate for an open source Lucene
> plugin.
>
> -- Jack Krupansky
>
> On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Charhon <
> guillaume@databerries.com> wrote:
>
>> Hello,
>>
>> We are currently storing geolocation events (about 1 per 5 minutes) for
>> each device we track. We currently have 2 TB of data. I would like to store
>> the device_id, the timestamp of the event, latitude and longitude. I though
>> about using the device_id as the partition key and timestamp as the
>> clustering column. It is great as events are naturally grouped by device
>> (very useful for our Spark jobs). However, if I would like to retrieve all
>> events of all devices of the last week I understood that Cassandra will
>> need to load all data and filter which does not seems to be clean on the
>> long term.
>>
>> How should I create my model?
>>
>> Best Regards
>>
>
>

--001a11c15fd4893eb705241c0b44
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>1. Don&#39;t make your partition unbound. It&#39;s te=
mpting to just use (device_id, timestamp). But soon or later you will have =
problem when time goes by. You can keep the partition bound by using (devic=
e_id, bucket, timestamp). Use hour, day, month or even year like Jack menti=
oned depending on the size of data.<br><br></div>2. As to your specific que=
ry, for a given partition and a time range, C* doesn&#39;t need to load the=
 whole partition then filter. It only retrieves the slice within the time r=
ange from disk because the data is clustered by timestamp.<br></div><div cl=
ass=3D"gmail_extra"><br><div class=3D"gmail_quote">On Mon, Nov 9, 2015 at 8=
:13 AM, Jack Krupansky <span dir=3D"ltr">&lt;<a href=3D"mailto:jack.krupans=
ky@gmail.com" target=3D"_blank">jack.krupansky@gmail.com</a>&gt;</span> wro=
te:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-=
left:1px #ccc solid;padding-left:1ex"><div dir=3D"ltr">The general rule in =
Cassandra data modeling is to look at all of your queries first and then to=
 declare a table for each query, even if that means storing multiple copies=
 of the data. So, create a second table with bucketed time as the partition=
 key (hour, 15 minutes, or whatever time interval makes sense to give 1 to =
10 megabytes per partition) and time and device as the clustering keys.<div=
><br></div><div>Or, consider DSE SEarch =C2=A0and then you can do whatever =
ad hoc queries you want using Solr. Or Stratio or TupleJump Stargate for an=
 open source Lucene plugin.</div></div><div class=3D"gmail_extra"><span cla=
ss=3D"HOEnZb"><font color=3D"#888888"><br clear=3D"all"><div><div><div dir=
=3D"ltr">-- Jack Krupansky</div></div></div></font></span><div><div class=
=3D"h5">
<br><div class=3D"gmail_quote">On Mon, Nov 9, 2015 at 8:05 AM, Guillaume Ch=
arhon <span dir=3D"ltr">&lt;<a href=3D"mailto:guillaume@databerries.com" ta=
rget=3D"_blank">guillaume@databerries.com</a>&gt;</span> wrote:<br><blockqu=
ote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc s=
olid;padding-left:1ex"><div dir=3D"ltr">Hello,=C2=A0<div><br></div><div><di=
v style=3D"font-size:12.8px">We are currently storing geolocation events (a=
bout 1 per 5 minutes) for each device we track. We currently have 2 TB of d=
ata. I would like to store the device_id, the timestamp of the event, latit=
ude and longitude. I though about using the device_id as the partition key =
and timestamp as the clustering column. It is great as events are naturally=
 grouped by device (very useful for our Spark jobs). However, if I would li=
ke to retrieve all events of all devices of the last week I understood that=
 Cassandra will need to load all data and filter which does not seems to be=
 clean on the long term.=C2=A0</div></div><div><br></div><div>How should I =
create my model?=C2=A0<br><br>Best Regards</div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div>

--001a11c15fd4893eb705241c0b44--