cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "J. Ryan Earl" <...@jryanearl.us>
Subject Re: Unbalanced disk load
Date Sat, 18 Jul 2015 17:09:27 GMT
Even with https://issues.apache.org/jira/browse/CASSANDRA-7386 data
balancing across JBOD setups is pretty horrible.  Having used JBOD for
about 2 years from 1.2.x and up, it is my opinion JBOD on Cassandra is
nascent at best and far from mature.  For a variety of reasons, JBOD should
perform better if IO and data is balanced across multiple devices due to
things like linux device queues, striping overhead, access contention, and
so forth.  However, data and access patterns simply are not balanced in
Cassandra JBOD setups.

Here's an example of what we see on one of our nodes:

/dev/sdd1             1.1T  202G  915G  19% /data/2
/dev/sde1             1.1T  136G  982G  13% /data/3
/dev/sdi1             1.1T  217G  901G  20% /data/7
/dev/sdc1             1.1T  402G  715G  36% /data/1
/dev/sdh1             1.1T  187G  931G  17% /data/6
/dev/sdf1             1.1T  201G  917G  18% /data/4
/dev/sdg1             1.1T  154G  963G  14% /data/5

Essentially, for a storage engine to make good use of JBOD, like HDFS or
Ceph does, the storage engines need to essentially be designed from the
ground up to use JBOD.  In Cassandra, a single sstable cannot be split at
the storage engine level to be split across members of the JBOD.  In our
case, we have a single sstable file that is bigger than all the data files
combined on other disks.  Looking at the disk using 402G, we see:

274G vcells_polished_data-vcells_polished-jb-38985-Data.db

A single sstable is using 274G.  In addition to data usage imbalance, we
see hot spots as well.  With static fields in particular, and CFs that
don't change much, you'll get CFs that end up compacting into fewer number
of large sstables.  With most of the data for a CF being in one sstable and
on one data volume, a single data volume then becomes a hotspot for reads
on that CF.  Cassandra tries to minimize the number of sstables a row will
be written across, but in particular after some compaction on an CFs that
are rarely updated, most of the data for a CF can end up in a single
sstable, and stables aren't split aross data volumes.  Thus a single volume
will be a hot-spot for access to that CF in a JBOD setup as Cassandra does
not effectively distribute data across individual volumes in all
circumstances.

There may be tuning which would help this, but it's specific to JBOD and
not somewhat that you would have to worry about in a single data volume
setup, ie RAID0.  With a RAID0, the downside of course, is that losing a
single member disk to the RAID0 takes the node down.  The upside is you
don't have to worry about the imbalance of both I/O and data footprint
across individual volumes.

Unlike HDFS, Ceph, and RAID for that matter, where you're dealing with
maximum fixed sizes blocks/stripes that are then distributed at a granular
level across the JBOD volumes, Cassandra is dealing with uncapped, low
granularity, variable sized sstable data files which it attempts to
distribute across JBOD volumes making JBOD far from ideal.  Frankly, it's
hard for me to imagine any columnar data store doing JBOD well.

On Fri, Jul 17, 2015 at 4:08 PM, Soerian Lieve <slieve@liveramp.com> wrote:

> Hi,
>
> I am currently benchmarking Cassandra with three machines, and on each
> machine I am seeing an unbalanced distribution of data among the data
> directories (1 per disk).
> I am concerned that this affects my write performance, is there anything
> that I can make the distribution be more even? Would raid0 be my best
> option?
>
> Details:
> 3 machines, each have 24 cores, 64GB of RAM, 7 SSDs of 500GB each.
> Commitlog is on a separate disk, cassandra.yaml configured according to
> Datastax' guide on cassandra.yaml.
> Total size of data is about 2TB, 14B records, all unique. Replication
> factor of 1.
>
> Thanks,
> Soerian
>

Mime
View raw message