Even with https://issues.apache.org/jira/browse/CASSANDRA-7386 data balancing across JBOD setups is pretty horrible.  Having used JBOD for about 2 years from 1.2.x and up, it is my opinion JBOD on Cassandra is nascent at best and far from mature.  For a variety of reasons, JBOD should perform better if IO and data is balanced across multiple devices due to things like linux device queues, striping overhead, access contention, and so forth.  However, data and access patterns simply are not balanced in Cassandra JBOD setups.

Here's an example of what we see on one of our nodes:

/dev/sdd1             1.1T  202G  915G  19% /data/2
/dev/sde1             1.1T  136G  982G  13% /data/3
/dev/sdi1             1.1T  217G  901G  20% /data/7
/dev/sdc1             1.1T  402G  715G  36% /data/1
/dev/sdh1             1.1T  187G  931G  17% /data/6
/dev/sdf1             1.1T  201G  917G  18% /data/4
/dev/sdg1             1.1T  154G  963G  14% /data/5

Essentially, for a storage engine to make good use of JBOD, like HDFS or Ceph does, the storage engines need to essentially be designed from the ground up to use JBOD.  In Cassandra, a single sstable cannot be split at the storage engine level to be split across members of the JBOD.  In our case, we have a single sstable file that is bigger than all the data files combined on other disks.  Looking at the disk using 402G, we see:

274G vcells_polished_data-vcells_polished-jb-38985-Data.db

A single sstable is using 274G.  In addition to data usage imbalance, we see hot spots as well.  With static fields in particular, and CFs that don't change much, you'll get CFs that end up compacting into fewer number of large sstables.  With most of the data for a CF being in one sstable and on one data volume, a single data volume then becomes a hotspot for reads on that CF.  Cassandra tries to minimize the number of sstables a row will be written across, but in particular after some compaction on an CFs that are rarely updated, most of the data for a CF can end up in a single sstable, and stables aren't split aross data volumes.  Thus a single volume will be a hot-spot for access to that CF in a JBOD setup as Cassandra does not effectively distribute data across individual volumes in all circumstances.

There may be tuning which would help this, but it's specific to JBOD and not somewhat that you would have to worry about in a single data volume setup, ie RAID0.  With a RAID0, the downside of course, is that losing a single member disk to the RAID0 takes the node down.  The upside is you don't have to worry about the imbalance of both I/O and data footprint across individual volumes.

Unlike HDFS, Ceph, and RAID for that matter, where you're dealing with maximum fixed sizes blocks/stripes that are then distributed at a granular level across the JBOD volumes, Cassandra is dealing with uncapped, low granularity, variable sized sstable data files which it attempts to distribute across JBOD volumes making JBOD far from ideal.  Frankly, it's hard for me to imagine any columnar data store doing JBOD well.


On Fri, Jul 17, 2015 at 4:08 PM, Soerian Lieve <slieve@liveramp.com> wrote:
Hi,

I am currently benchmarking Cassandra with three machines, and on each machine I am seeing an unbalanced distribution of data among the data directories (1 per disk). 
I am concerned that this affects my write performance, is there anything that I can make the distribution be more even? Would raid0 be my best option?

Details:
3 machines, each have 24 cores, 64GB of RAM, 7 SSDs of 500GB each. 
Commitlog is on a separate disk, cassandra.yaml configured according to Datastax' guide on cassandra.yaml.
Total size of data is about 2TB, 14B records, all unique. Replication factor of 1.

Thanks,
Soerian