From Michael Shuler <>
Subject Re: Really need some advices on large data considerations
Date Thu, 15 May 2014 02:54:09 GMT
On 05/13/2014 08:13 PM, Yatong Zhang wrote:
> Thank you Aaron, but we're planning about 20T per node, is that feasible?

20T per node is 5x greater than the max recommended data per node on 
high-end spec hardware of 5T/node on nodes with 16+ cores, 128-256G, 
SSD, and 10gigE.

pgs 12-13 (the whole doc is well worth a careful read):

In looking at your other disk space thread, I see that you are using 4T 
drives, so those definitely aren't SSD. It also looks like you 
partitioned /dev/sda for an OS partition and are using the rest for data 
- I assume your commitlog is on /dev/sda1, so your /dev/sda3 data 
partition is on the same spindle as your data - not recommended..

I would RAID0 all those data drives, personally, and give up managing 
them separately. They are on multiple PCIe controllers, one drive per 
channel, right?

The trouble you are having with running out of disk space, then opting 
for LCS which is about 2x more I/O intensive; this could add a different 
level of pain on spindles.

I would highly suggest re-thinking about how you want to set up your 
data model and re-plan your cluster appropriately, to be honest. I'm not 
saying that working with what you have isn't at all possible, but you 
are experiencing pain due to pushing far beyond the bounds of the 
suggested recommendations for Cassandra. If you have a high threshold 
for pain, then carry on. I mean this admirably - I would *love* to hear 
back that you have sorted out all your issues, and do continue to post 
questions for help. I could be completely off-base in my reading.

I do think many more nodes is the way to go with this much data - this 
is Cassandra's strength. I'm not sure what your data actually contains, 
but if you are using large blobs like image data, think about putting 
that blob data somewhere else, and storing only the metadata in 
Cassandra with, ie. URL pointers on where to retrieve the image data - 
stuff like that will help.

Warm regards,
Michael Shuler

