From user-return-61215-archive-asf-public=cust-asf.ponee.io@cassandra.apache.org Tue May 29 19:24:38 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 31E82180648 for ; Tue, 29 May 2018 19:24:38 +0200 (CEST) Received: (qmail 10060 invoked by uid 500); 29 May 2018 17:24:36 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 10050 invoked by uid 99); 29 May 2018 17:24:36 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 29 May 2018 17:24:36 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 238D5C0231 for ; Tue, 29 May 2018 17:24:36 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.869 X-Spam-Level: * X-Spam-Status: No, score=1.869 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001, T_DKIMWL_WL_MED=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id GiKhbd5iGhdZ for ; Tue, 29 May 2018 17:24:33 +0000 (UTC) Received: from mail-qk0-f194.google.com (mail-qk0-f194.google.com [209.85.220.194]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id DC4D75F3CE for ; Tue, 29 May 2018 17:24:32 +0000 (UTC) Received: by mail-qk0-f194.google.com with SMTP id h19-v6so12098588qkj.10 for ; Tue, 29 May 2018 10:24:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=ZTdoEAD3h5wHzbJvslcFg1uuoW6iIigRswGJyU8iiXU=; b=M7RsNDUd5s0rVQDl4kWlebhHttp1P3Vin6KK91mUqI0iiS7RKFufLTx/TadYek4k1e qPF33DgDE1TQ17yeXMl8hButrZD18kfGrA+MrryxIx5pREn74TTKEaQX3vNxUezBZ92g VeVFDi3LON92SxHzoD9aH/CTDYN899LsGhu9EPH8cM+Y8OyZgStqNg3Y1i7uIyxZd1y8 ndNXPcRA05mvoOeMkUDwJS2jgT2kHUqkzH5VHeUh+xRtgqTk9vHE/ct4g1i9yq29fR4U N06jo1/0VNM1LcIiBpRDKbvZmqBWrHatVCJIMm4h7scuzpi6KK0QZNXOUteQrC/WJ0GU lIyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=ZTdoEAD3h5wHzbJvslcFg1uuoW6iIigRswGJyU8iiXU=; b=HM5HuR2IpXIz7eym7VliCAFS7fGbZNj7IPMaJPs1Y1zEUnaYnoaYrigVVM0Syq/x9E l4s3Qb/sbt2nX+8daqpxB1B2kHGQ0Vb0bIpeqin9QNlSLubEM/GdgBnhIfrZA/GzjT4a r3y8rsCAQ3Nladg+01Jf39Ge1WwZaiaG1otvOOh62P9zuyuphr1CeQPkn+5YCSoecXdv TNaQbpNQ7GyZb36z3WG3fTP/fox9oR4Qa48TxJyJWoY7wN89G5kq3qDmv1vhfVK/Pv5K kkYrsYD4oCXe4iTrj6kTRp7/zhgveLD/KD42qtlzKdKpj1Cs+NbtA6JPCMbJF3u7giYv Cw9g== X-Gm-Message-State: ALKqPwdFe5pQgZcVf/tbP2H9yU4A7xLTbKvvyN0HWMkUd0NGgyNie1ft 55/1w5G0/PRJ9dFJKXgA4Lfc7z6qIvwncs1hELCVlQ== X-Google-Smtp-Source: ADUXVKIGyU9dapJC7T1EnIhpx9ccCJx8JHFkOT9QFr16lO5WAaBc53HyYrSigp7tj9Xn7HvIsEO3SPV8ggWddbn0/qc= X-Received: by 2002:a37:b002:: with SMTP id z2-v6mr15231315qke.222.1527614671600; Tue, 29 May 2018 10:24:31 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Haris Altaf Date: Tue, 29 May 2018 22:24:19 +0500 Message-ID: Subject: Re: Time Series schema performance To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary="000000000000dac033056d5b811e" --000000000000dac033056d5b811e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi All, I have a related question. How do you down-sample your timeseries data? regards, Haris On Tue, 29 May 2018 at 22:11 Jonathan Haddad wrote: > I wrote a post on this topic a while ago, might be worth reading over: > > http://thelastpickle.com/blog/2017/08/02/time-series-data-modeling-massiv= e-scale.html > On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa wrote: > > > There=E2=80=99s a third option which is doing bucketing by time instead= of by > hash, which tends to perform quite well if you=E2=80=99re using TWCS as i= t makes it > quite likely that a read can be served by a single sstable > > > -- > > Jeff Jirsa > > > > On May 29, 2018, at 6:49 AM, sujeet jog wrote: > > > Folks, > > I have two alternatives for the time series schema i have, and wanted t= o > weigh of on one of the schema . > > > The query is given id, & timestamp, read the metrics associated with th= e > id > > > The records are inserted every 5 mins, and the number of id's =3D 2 > million, > > so at every 5mins it will be 2 million records that will be written. > > > Bucket Range : 0 - 5K. > > > Schema 1 ) > > > create table ( > > id timeuuid, > > bucketid Int, > > date date, > > timestamp timestamp, > > metricName1 BigInt, > > metricName2 BigInt. > > ... > > ..... > > metricName300 BigInt, > > > Primary Key (( day, bucketid ) , id, timestamp) > > ) > > > BucketId is just a murmur3 hash of the id which acts as a splitter to > group id's in a partition > > > > Pros : - > > > Efficient write performance, since data is written to minimal partition= s > > > Cons : - > > > While the first schema works best when queried programmatically, but is= a > bit inflexible If it has to be integrated with 3rd party BI tools like > tableau, bucket-id cannot be generated from tableau as it's not part of t= he > view etc.. > > > > Schema 2 ) > > Same as above, without bucketid & date. > > > Primary Key (id, timestamp ) > > > Pros : - > > > BI tools don't need to generate bucket id lookups, > > > Cons :- > > Too many partitions are written every 5 mins, say 2 million records > written in distinct 2 million partitions. > > > > > I believe writing this data to commit log is same in case of Schema 1 & > Schema 2 ) , but the actual performance bottleneck could be compaction, > since the data from memtable is transformed to ssTables often based on th= e > memory settings, and > > the header for every SSTable would maintain partitionIndex with > byteoffsets, > > > wanted to guage how bad can the performance of Schema-2 go with respe= ct > to Write/Compaction having to do many diskseeks. > > > compacting many tables but with too many partitionIndex entries because > of the high number of parititions , can this be a bottleneck ?.. > > > Any indept performance explanation of Schema-2 would be very much helpf= ul > > > > Thanks, > > > > > -- > Jon Haddad > http://www.rustyrazorblade.com > twitter: rustyrazorblade > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org > For additional commands, e-mail: user-help@cassandra.apache.org > > -- regards, Haris --000000000000dac033056d5b811e Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi All,
I have a related question. How do you down-samp= le your timeseries data?


regards,
Haris

On Tue,= 29 May 2018 at 22:11 Jonathan Haddad <jon@jonhaddad.com> wrote:
http://thelastp= ickle.com/blog/2017/08/02/time-series-data-modeling-massive-scale.html<= br> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa <jjirsa@gmail.com> wrote:

> There=E2=80=99s a third option which is doing bucketing by time instea= d of by
hash, which tends to perform quite well if you=E2=80=99re using TWCS as it = makes it
quite likely that a read can be served by a single sstable

> --
> Jeff Jirsa


> On May 29, 2018, at 6:49 AM, sujeet jog <sujeet.jog@gmail.com> wrote:

> Folks,
> I have two alternatives for the time series schema i have, and wanted = to
weigh of on one of the schema .

> The query is given id, & timestamp, read the metrics associated wi= th the
id

> The records are inserted every 5 mins, and the number of id's =3D = 2
million,
> so at every 5mins=C2=A0 it will be 2 million records that will be writ= ten.

> Bucket Range=C2=A0 : 0 - 5K.

> Schema 1 )

> create table (
> id timeuuid,
> bucketid Int,
> date date,
> timestamp timestamp,
> metricName1=C2=A0 =C2=A0BigInt,
> metricName2 BigInt.
> ...
> .....
> metricName300 BigInt,

> Primary Key (( day, bucketid ) ,=C2=A0 id, timestamp)
> )

> BucketId is just a murmur3 hash of the id=C2=A0 which acts as a splitt= er to
group id's in a partition


> Pros : -

> Efficient write performance, since data is written to minimal partitio= ns

> Cons : -

> While the first schema works best when queried programmatically, but i= s a
bit inflexible If it has to be integrated with 3rd party BI tools like
tableau, bucket-id cannot be generated from tableau as it's not part of= the
view etc..


> Schema 2 )
> Same as above, without bucketid &=C2=A0 date.

> Primary Key (id, timestamp )

> Pros : -

> BI tools don't need to generate bucket id lookups,

> Cons :-
> Too many partitions are written every 5 mins,=C2=A0 say 2 million reco= rds
written in distinct 2 million partitions.



> I believe writing this data to commit log is same in case of Schema 1 = &
Schema 2 ) , but the actual performance bottleneck could be compaction,
since the data from memtable is transformed to ssTables often based on the<= br> memory settings, and
> the header for every SSTable would maintain partitionIndex with
=C2=A0 byteoffsets,

>=C2=A0 =C2=A0wanted to guage how bad can the performance of Schema-2 go= with respect
to Write/Compaction having to do many diskseeks.

> compacting many tables but with too many partitionIndex entries becaus= e
of the high number of parititions ,=C2=A0 can this be a bottleneck ?..

> Any indept performance explanation of Schema-2 would be very much help= ful


> Thanks,




--
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

--
regards,
Haris
--000000000000dac033056d5b811e--