> 700GB/512MB*5(files per SST) = 7000 files, that is OK from my view.
> 700GB/5MB*5 = 700000 files, that is too much for single directory, too much
> memory used for SST data, too huge compaction queue (that leads to strange
> pauses, I suppose because of compactor thinking what to compact next),...

Not sure why a lot of files is a problem... modern filesystems deal
with that pretty well.

May be. May be it's not filesystem, but cassandra. I've seen slowdowns of compaction when the compaction queue is too large. And it can be too large if you have a lot of SSTables. Note that each SSTable is both FS metadata (and FS metadata cache can be limited) and cassandra in-memory data.
Anyway, as for me, performance test would be great in this area. Otherwise it's all speculations.

Really large sstables mean that compactions now are taking a lot more
disk IO and time to complete.

As for me, this point is valid only when your flushes are small. Otherwise you still need to compact the whole key range flush cover, no matter if this is one large file or multiple small ones. One large file can even be cheapier to compact.
Remember, Leveled Compaction is more
disk IO intensive, so using large sstables makes that even worse.
This is a big reason why the default is 5MB. Also, each level is 10x
the size as the previous level. Also, for level compaction, you need
10x the sstable size worth of free space to do compactions. So now
you need 5GB of free disk, vs 50MB of free disk.

I really don't think 5GB of free space is too much :)

Also, if you're doing deletes in those CF's, that old, deleted data is
going to stick around a LOT longer with 512MB files, because it can't
get deleted until you have 10x512MB files to compact to level 2.
Heaven forbid it doesn't get deleted then because each level is 10x
bigger so you end up waiting a LOT longer to actually delete that data
from disk.

But if I have small SSTables, all my data goes to high levels (4th for me when I've had 128M setting). And it also take time for updates to reach this level. I am not sure which way is faster.

Now, if you're using SSD's then larger sstables is probably doable,
but even then I'd guesstimate 50MB is far more reasonable then 512MB.

I don't think SSD are great for writes/compaction. Cassandra does this in streaming fashion and regular HDDs are faster then SSDs for linear read/write. SSD are good for random access, that for cassandra means reads.

P.S. I still think my way is better, yet it would be great to perform some real tests.


Aaron Turner Twitter: @synfinatic
Best regards,
Vitalii Tymchyshyn