Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
MIME-Version: 1.0
From: Mike Heffner <mike@librato.com>
Date: Sun, 19 Jun 2016 12:17:57 -0400
Message-ID: <CAOac0GATz-oc-qOVtY928WccNrtTMhASvZrQSD613E-XYLF-Kg@mail.gmail.com>
Subject: Optimizing IOPS during sequential I/O (compactions)
To: dev@cassandra.apache.org
Content-Type: multipart/alternative; boundary=94eb2c063b3851b1020535a3ee1d
archived-at: Sun, 19 Jun 2016 16:18:11 -0000

--94eb2c063b3851b1020535a3ee1d
Content-Type: text/plain; charset=UTF-8

Hi,

I'm curious to know if anyone has attempted to improve read IOPS
performance during sequential I/O operations (largely compactions) while
still maintaining read performance for small row, random-access client
reads?

Our use case is very high write load to read load ratio with rows that are
small (< 1-2kb). We've taken many of the steps to ensure that client reads
to random-access rows are optimal by reducing read_ahead and even using a
smaller default LZ4 chunk size. So far performance has been great with p95
read times that are < 10ms.

However, we have noticed that our total read IOPS to the Cassandra data
drive is extremely high compared to our write IOPS, almost 15x the write
IOPS to the same drive. We even setup a ring that took the same write load
with zero client reads and observed that the high read IOPS were driven by
compaction operations. During large (>50GB) compactions, write and read
volume (bytes) were nearly identical which matched our assumptions, while
read iops were 15x write iops.

When we plotted the average read and write op size we saw an average read
ops size of just under 5KB and average write op of 120KB. Given we are
using the default disk access mode of mmap, this aligns with our assumption
that we are paging in a single 4KB page at a time while the write size is
coalescing write flushes. We wanted to test this, so we switched a single
node to `disk_access_mode:standard`, which should do reads based on the
chunksizes, and found that read op size increased to ~7.5KB:

https://imgur.com/okbfFby

We don't want to sacrifice our read performance, but we also must
scale/size our disk performance based on peak iops. If we could cut the
read iops by a quarter or even half during compaction operations, that
would mean a large cost savings. We are also limited on drive throughput,
so there's a theoretical maximum op size we'd want to use to stay under
that throughput limit. Alternatively, we could also tune compaction
throughput to maintain that limit too.

Has any work been done to optimize sequential I/O operations in Cassandra?
Naively it seems that sequential I/O operations could use a standard disk
access mode reader with configurable block size while normal read
operations stuck to the mmap'd segments. Being unfamiliar with the code,
are compaction/sequential sstable reads done through any single interface
or does it use the same as normal read ops?

Thoughts?

-Mike

-- 

  Mike Heffner <mike@librato.com>
  Librato, Inc.

--94eb2c063b3851b1020535a3ee1d--