cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Jirsa <jji...@gmail.com>
Subject Re: Cluster sizing for huge dataset
Date Sun, 29 Sep 2019 01:49:54 GMT
A few random thoughts here

1) 90 nodes / 900T in a cluster isn’t that big. petabyte per cluster is a manageable size.


2) The 2TB guidance is old and irrelevant for most people, what you really care about is how
fast you can replace the failed machine

You’d likely be ok going significantly larger than that if you use a few vnodes, since that’ll
help rebuild faster (you’ll stream from more sources on rebuild)

If you don’t want to use vnodes, buy big machines and run multiple Cassandra instances in
it - it’s not hard to run 3-4TB per instance and 12-16T of SSD per machine 

3) Transient replication in 4.0 could potentially be worth trying out, depending on your risk
tolerance. Doing 2 full and one transient replica may save you 30% storage 

4) Note that you’re not factoring in compression, and some of the recent zstd work may go
a long way if your sensor data is similar / compressible.

> On Sep 28, 2019, at 1:23 PM, DuyHai Doan <doanduyhai@gmail.com> wrote:
> 
> Hello users
> 
> I'm facing with a very challenging exercise: size a cluster with a huge dataset.
> 
> Use-case = IoT
> 
> Number of sensors: 30 millions
> Frequency of data: every 10 minutes
> Estimate size of a data: 100 bytes (including clustering columns)
> Data retention: 2 years
> Replication factor: 3 (pretty standard)
> 
> A very quick math gives me:
> 
> 6 data points / hour * 24 * 365 ~50 000 data points/ year/ sensor
> 
> In term of size, it is 50 000 x 100 bytes = 5Mb worth of data /year /sensor
> 
> Now the big problem is that we have 30 millions of sensor so the disk
> requirements adds up pretty fast: 5 Mb * 30 000 000 = 5Tb * 30 = 150Tb
> worth of data/year
> 
> We want to store data for 2 years => 300Tb
> 
> We have RF=3 ==> 900Tb !!!!
> 
> Now, according to commonly recommended density (with SSD), one shall
> not exceed 2Tb of data per node, which give us a rough sizing of 450
> nodes cluster !!!
> 
> Even if we push the limit up to 10Tb using TWCS (has anyone tried this
> ?) We would still need 90 beefy nodes to support this.
> 
> Any thoughts/ideas to reduce the nodes count or increase density and
> keep the cluster manageable ?
> 
> Regards
> 
> Duy Hai DOAN
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org


Mime
View raw message