From user-return-61215-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org  Tue May 29 19:24:38 2018
Return-Path: <user-return-61215-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 31E82180648
	for <archive-asf-public@cust-asf.ponee.io>; Tue, 29 May 2018 19:24:38 +0200 (CEST)
Received: (qmail 10060 invoked by uid 500); 29 May 2018 17:24:36 -0000
Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@cassandra.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@cassandra.apache.org>
List-Post: <mailto:user@cassandra.apache.org>
List-Id: <user.cassandra.apache.org>
Reply-To: user@cassandra.apache.org
Delivered-To: mailing list user@cassandra.apache.org
Received: (qmail 10050 invoked by uid 99); 29 May 2018 17:24:36 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 May 2018 17:24:36 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 238D5C0231
	for <user@cassandra.apache.org>; Tue, 29 May 2018 17:24:36 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.869
X-Spam-Level: *
X-Spam-Status: No, score=1.869 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01,
	RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01]
	autolearn=disabled
Authentication-Results: spamd4-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
	by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024)
	with ESMTP id GiKhbd5iGhdZ for <user@cassandra.apache.org>;
	Tue, 29 May 2018 17:24:33 +0000 (UTC)
Received: from mail-qk0-f194.google.com (mail-qk0-f194.google.com [209.85.220.194])
	by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id DC4D75F3CE
	for <user@cassandra.apache.org>; Tue, 29 May 2018 17:24:32 +0000 (UTC)
Received: by mail-qk0-f194.google.com with SMTP id h19-v6so12098588qkj.10
        for <user@cassandra.apache.org>; Tue, 29 May 2018 10:24:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=ZTdoEAD3h5wHzbJvslcFg1uuoW6iIigRswGJyU8iiXU=;
        b=M7RsNDUd5s0rVQDl4kWlebhHttp1P3Vin6KK91mUqI0iiS7RKFufLTx/TadYek4k1e
         qPF33DgDE1TQ17yeXMl8hButrZD18kfGrA+MrryxIx5pREn74TTKEaQX3vNxUezBZ92g
         VeVFDi3LON92SxHzoD9aH/CTDYN899LsGhu9EPH8cM+Y8OyZgStqNg3Y1i7uIyxZd1y8
         ndNXPcRA05mvoOeMkUDwJS2jgT2kHUqkzH5VHeUh+xRtgqTk9vHE/ct4g1i9yq29fR4U
         N06jo1/0VNM1LcIiBpRDKbvZmqBWrHatVCJIMm4h7scuzpi6KK0QZNXOUteQrC/WJ0GU
         lIyw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=ZTdoEAD3h5wHzbJvslcFg1uuoW6iIigRswGJyU8iiXU=;
        b=HM5HuR2IpXIz7eym7VliCAFS7fGbZNj7IPMaJPs1Y1zEUnaYnoaYrigVVM0Syq/x9E
         l4s3Qb/sbt2nX+8daqpxB1B2kHGQ0Vb0bIpeqin9QNlSLubEM/GdgBnhIfrZA/GzjT4a
         r3y8rsCAQ3Nladg+01Jf39Ge1WwZaiaG1otvOOh62P9zuyuphr1CeQPkn+5YCSoecXdv
         TNaQbpNQ7GyZb36z3WG3fTP/fox9oR4Qa48TxJyJWoY7wN89G5kq3qDmv1vhfVK/Pv5K
         kkYrsYD4oCXe4iTrj6kTRp7/zhgveLD/KD42qtlzKdKpj1Cs+NbtA6JPCMbJF3u7giYv
         Cw9g==
X-Gm-Message-State: ALKqPwdFe5pQgZcVf/tbP2H9yU4A7xLTbKvvyN0HWMkUd0NGgyNie1ft
	55/1w5G0/PRJ9dFJKXgA4Lfc7z6qIvwncs1hELCVlQ==
X-Google-Smtp-Source: ADUXVKIGyU9dapJC7T1EnIhpx9ccCJx8JHFkOT9QFr16lO5WAaBc53HyYrSigp7tj9Xn7HvIsEO3SPV8ggWddbn0/qc=
X-Received: by 2002:a37:b002:: with SMTP id z2-v6mr15231315qke.222.1527614671600;
 Tue, 29 May 2018 10:24:31 -0700 (PDT)
MIME-Version: 1.0
References: <CAG06hE3u_8tiyO8kzMaWNUx_LNLmtJ4GQ+M-v4dqcS3LpVCc9g@mail.gmail.com>
 <B611B67E-9CF9-4BA2-8CA4-64CB28C14118@gmail.com> <CACUnPaDTPWAOftD-QRvoT4=L7fEZnZc3anVpqhxB-ZzHg9LxJw@mail.gmail.com>
In-Reply-To: <CACUnPaDTPWAOftD-QRvoT4=L7fEZnZc3anVpqhxB-ZzHg9LxJw@mail.gmail.com>
From: Haris Altaf <m.haris.ce@gmail.com>
Date: Tue, 29 May 2018 22:24:19 +0500
Message-ID: <CAD_a8WSPJj4_zD=mLYWBG=CGNzgf2Hg5pAntnUAKA+WKLqAkCg@mail.gmail.com>
Subject: Re: Time Series schema performance
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary="000000000000dac033056d5b811e"

--000000000000dac033056d5b811e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi All,
I have a related question. How do you down-sample your timeseries data?


regards,
Haris

On Tue, 29 May 2018 at 22:11 Jonathan Haddad <jon@jonhaddad.com> wrote:

> I wrote a post on this topic a while ago, might be worth reading over:
>
> http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massiv=
e-scale.html
> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa <jjirsa@gmail.com> wrote:
>
> > There=E2=80=99s a third option which is doing bucketing by time instead=
 of by
> hash, which tends to perform quite well if you=E2=80=99re using TWCS as i=
t makes it
> quite likely that a read can be served by a single sstable
>
> > --
> > Jeff Jirsa
>
>
> > On May 29, 2018, at 6:49 AM, sujeet jog <sujeet.jog@gmail.com> wrote:
>
> > Folks,
> > I have two alternatives for the time series schema i have, and wanted t=
o
> weigh of on one of the schema .
>
> > The query is given id, & timestamp, read the metrics associated with th=
e
> id
>
> > The records are inserted every 5 mins, and the number of id's =3D 2
> million,
> > so at every 5mins  it will be 2 million records that will be written.
>
> > Bucket Range  : 0 - 5K.
>
> > Schema 1 )
>
> > create table (
> > id timeuuid,
> > bucketid Int,
> > date date,
> > timestamp timestamp,
> > metricName1   BigInt,
> > metricName2 BigInt.
> > ...
> > .....
> > metricName300 BigInt,
>
> > Primary Key (( day, bucketid ) ,  id, timestamp)
> > )
>
> > BucketId is just a murmur3 hash of the id  which acts as a splitter to
> group id's in a partition
>
>
> > Pros : -
>
> > Efficient write performance, since data is written to minimal partition=
s
>
> > Cons : -
>
> > While the first schema works best when queried programmatically, but is=
 a
> bit inflexible If it has to be integrated with 3rd party BI tools like
> tableau, bucket-id cannot be generated from tableau as it's not part of t=
he
> view etc..
>
>
> > Schema 2 )
> > Same as above, without bucketid &  date.
>
> > Primary Key (id, timestamp )
>
> > Pros : -
>
> > BI tools don't need to generate bucket id lookups,
>
> > Cons :-
> > Too many partitions are written every 5 mins,  say 2 million records
> written in distinct 2 million partitions.
>
>
>
> > I believe writing this data to commit log is same in case of Schema 1 &
> Schema 2 ) , but the actual performance bottleneck could be compaction,
> since the data from memtable is transformed to ssTables often based on th=
e
> memory settings, and
> > the header for every SSTable would maintain partitionIndex with
>   byteoffsets,
>
> >   wanted to guage how bad can the performance of Schema-2 go with respe=
ct
> to Write/Compaction having to do many diskseeks.
>
> > compacting many tables but with too many partitionIndex entries because
> of the high number of parititions ,  can this be a bottleneck ?..
>
> > Any indept performance explanation of Schema-2 would be very much helpf=
ul
>
>
> > Thanks,
>
>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
> --
regards,
Haris

--000000000000dac033056d5b811e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi All,<br>I have a related question. How do you down-samp=
le your timeseries data?<div><br></div><div><br></div><div>regards,</div><d=
iv>Haris</div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr">On Tue,=
 29 May 2018 at 22:11 Jonathan Haddad &lt;<a href=3D"mailto:jon@jonhaddad.c=
om">jon@jonhaddad.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_qu=
ote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex=
">I wrote a post on this topic a while ago, might be worth reading over:<br=
>
<a href=3D"http://thelastpickle.com/blog/2017/08/02/time-series-data-modeli=
ng-massive-scale.html" rel=3D"noreferrer" target=3D"_blank">http://thelastp=
ickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html</a><=
br>
On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa &lt;<a href=3D"mailto:jjirsa@gma=
il.com" target=3D"_blank">jjirsa@gmail.com</a>&gt; wrote:<br>
<br>
&gt; There=E2=80=99s a third option which is doing bucketing by time instea=
d of by<br>
hash, which tends to perform quite well if you=E2=80=99re using TWCS as it =
makes it<br>
quite likely that a read can be served by a single sstable<br>
<br>
&gt; --<br>
&gt; Jeff Jirsa<br>
<br>
<br>
&gt; On May 29, 2018, at 6:49 AM, sujeet jog &lt;<a href=3D"mailto:sujeet.j=
og@gmail.com" target=3D"_blank">sujeet.jog@gmail.com</a>&gt; wrote:<br>
<br>
&gt; Folks,<br>
&gt; I have two alternatives for the time series schema i have, and wanted =
to<br>
weigh of on one of the schema .<br>
<br>
&gt; The query is given id, &amp; timestamp, read the metrics associated wi=
th the<br>
id<br>
<br>
&gt; The records are inserted every 5 mins, and the number of id&#39;s =3D =
2<br>
million,<br>
&gt; so at every 5mins=C2=A0 it will be 2 million records that will be writ=
ten.<br>
<br>
&gt; Bucket Range=C2=A0 : 0 - 5K.<br>
<br>
&gt; Schema 1 )<br>
<br>
&gt; create table (<br>
&gt; id timeuuid,<br>
&gt; bucketid Int,<br>
&gt; date date,<br>
&gt; timestamp timestamp,<br>
&gt; metricName1=C2=A0 =C2=A0BigInt,<br>
&gt; metricName2 BigInt.<br>
&gt; ...<br>
&gt; .....<br>
&gt; metricName300 BigInt,<br>
<br>
&gt; Primary Key (( day, bucketid ) ,=C2=A0 id, timestamp)<br>
&gt; )<br>
<br>
&gt; BucketId is just a murmur3 hash of the id=C2=A0 which acts as a splitt=
er to<br>
group id&#39;s in a partition<br>
<br>
<br>
&gt; Pros : -<br>
<br>
&gt; Efficient write performance, since data is written to minimal partitio=
ns<br>
<br>
&gt; Cons : -<br>
<br>
&gt; While the first schema works best when queried programmatically, but i=
s a<br>
bit inflexible If it has to be integrated with 3rd party BI tools like<br>
tableau, bucket-id cannot be generated from tableau as it&#39;s not part of=
 the<br>
view etc..<br>
<br>
<br>
&gt; Schema 2 )<br>
&gt; Same as above, without bucketid &amp;=C2=A0 date.<br>
<br>
&gt; Primary Key (id, timestamp )<br>
<br>
&gt; Pros : -<br>
<br>
&gt; BI tools don&#39;t need to generate bucket id lookups,<br>
<br>
&gt; Cons :-<br>
&gt; Too many partitions are written every 5 mins,=C2=A0 say 2 million reco=
rds<br>
written in distinct 2 million partitions.<br>
<br>
<br>
<br>
&gt; I believe writing this data to commit log is same in case of Schema 1 =
&amp;<br>
Schema 2 ) , but the actual performance bottleneck could be compaction,<br>
since the data from memtable is transformed to ssTables often based on the<=
br>
memory settings, and<br>
&gt; the header for every SSTable would maintain partitionIndex with<br>
=C2=A0 byteoffsets,<br>
<br>
&gt;=C2=A0 =C2=A0wanted to guage how bad can the performance of Schema-2 go=
 with respect<br>
to Write/Compaction having to do many diskseeks.<br>
<br>
&gt; compacting many tables but with too many partitionIndex entries becaus=
e<br>
of the high number of parititions ,=C2=A0 can this be a bottleneck ?..<br>
<br>
&gt; Any indept performance explanation of Schema-2 would be very much help=
ful<br>
<br>
<br>
&gt; Thanks,<br>
<br>
<br>
<br>
<br>
-- <br>
Jon Haddad<br>
<a href=3D"http://www.rustyrazorblade.com" rel=3D"noreferrer" target=3D"_bl=
ank">http://www.rustyrazorblade.com</a><br>
twitter: rustyrazorblade<br>
<br>
---------------------------------------------------------------------<br>
To unsubscribe, e-mail: <a href=3D"mailto:user-unsubscribe@cassandra.apache=
.org" target=3D"_blank">user-unsubscribe@cassandra.apache.org</a><br>
For additional commands, e-mail: <a href=3D"mailto:user-help@cassandra.apac=
he.org" target=3D"_blank">user-help@cassandra.apache.org</a><br>
<br>
</blockquote></div>-- <br><div dir=3D"ltr" class=3D"gmail_signature" data-s=
martmail=3D"gmail_signature"><div dir=3D"ltr">regards,<div>Haris</div></div=
></div>

--000000000000dac033056d5b811e--