cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DuyHai Doan <doanduy...@gmail.com>
Subject Re: Cluster sizing for huge dataset
Date Sun, 29 Sep 2019 07:29:57 GMT
Thank you Jeff for the hints

We are targeting to reach 20Tb/machine using TWCS and 8 vnodes (using
the new token allocation algo). Also we will try the new zstd
compression.

About transient replication, the underlying trade-offs and semantics
are hard to understand for common people (for example, reading at CL
ONE in the face of 2 full replicas loss leads to unavailable
exception, unlike normal replication) so we will let it out for the
moment

Regards

On Sun, Sep 29, 2019 at 3:50 AM Jeff Jirsa <jjirsa@gmail.com> wrote:
>
> A few random thoughts here
>
> 1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a manageable
size.
>
> 2) The 2TB guidance is old and irrelevant for most people, what you really care about
is how fast you can replace the failed machine
>
> You’d likely be ok going significantly larger than that if you use a few vnodes, since
that’ll help rebuild faster (you’ll stream from more sources on rebuild)
>
> If you don’t want to use vnodes, buy big machines and run multiple Cassandra instances
in it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per machine
>
> 3) Transient replication in 4.0 could potentially be worth trying out, depending on your
risk tolerance. Doing 2 full and one transient replica may save you 30% storage
>
> 4) Note that you’re not factoring in compression, and some of the recent zstd work
may go a long way if your sensor data is similar / compressible.
>
> > On Sep 28, 2019, at 1:23 PM, DuyHai Doan <doanduyhai@gmail.com> wrote:
> >
> > Hello users
> >
> > I'm facing with a very challenging exercise: size a cluster with a huge dataset.
> >
> > Use-case = IoT
> >
> > Number of sensors: 30 millions
> > Frequency of data: every 10 minutes
> > Estimate size of a data: 100 bytes (including clustering columns)
> > Data retention: 2 years
> > Replication factor: 3 (pretty standard)
> >
> > A very quick math gives me:
> >
> > 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
> >
> > In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
> >
> > Now the big problem is that we have 30 millions of sensor so the disk
> > requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
> > worth of data/year
> >
> > We want to store data for 2 years => 300Tb
> >
> > We have RF=3 ==> 900Tb !!!!
> >
> > Now, according to commonly recommended density (with SSD), one shall
> > not exceed 2Tb of data per node, which give us a rough sizing of 450
> > nodes cluster !!!
> >
> > Even if we push the limit up to 10Tb using TWCS (has anyone tried this
> > ?) We would still need 90 beefy nodes to support this.
> >
> > Any thoughts/ideas to reduce the nodes count or increase density and
> > keep the cluster manageable ?
> >
> > Regards
> >
> > Duy Hai DOAN
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: user-help@cassandra.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


Mime
View raw message