Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id BF585200B23 for ; Sun, 19 Jun 2016 18:18:11 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B9446160A53; Sun, 19 Jun 2016 16:18:11 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0BC4C160A4E for ; Sun, 19 Jun 2016 18:18:10 +0200 (CEST) Received: (qmail 92608 invoked by uid 500); 19 Jun 2016 16:18:09 -0000 Mailing-List: contact dev-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list dev@cassandra.apache.org Received: (qmail 92596 invoked by uid 99); 19 Jun 2016 16:18:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 19 Jun 2016 16:18:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 262DE180481 for ; Sun, 19 Jun 2016 16:18:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.179 X-Spam-Level: * X-Spam-Status: No, score=1.179 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=librato.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id r6a5WQrsV3DP for ; Sun, 19 Jun 2016 16:18:05 +0000 (UTC) Received: from mail-qk0-f170.google.com (mail-qk0-f170.google.com [209.85.220.170]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id EEF355F4E9 for ; Sun, 19 Jun 2016 16:18:04 +0000 (UTC) Received: by mail-qk0-f170.google.com with SMTP id t127so14234897qkf.1 for ; Sun, 19 Jun 2016 09:18:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=librato.com; s=google; h=mime-version:from:date:message-id:subject:to; bh=ne1oxSY1X3sJ01rM3Iyl1WUTdUREarP1MOfZFgpfgmo=; b=E5/FgAVcewIgDmh0yZjir4vItY+JS10fiv+2bfY27eGgHiGpZSugGi+53XHz2nIZ1u f7t86jYVGdfxnRsnxv6bEZN8uvJQ2ZCoiYYMOomVEIej7AgsCS6DHSlKSBIuoE6ik/ZJ SN0TEZkcBxb4R/iJYHfug/wJZuKnjKzLDOK2s= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=ne1oxSY1X3sJ01rM3Iyl1WUTdUREarP1MOfZFgpfgmo=; b=m9HqHpimnoEPjrZoO6U3SrsPuylPOpGbiY4KeCrAxY/vyIcYGH7CZCp9q4pJQPBhwh xus3Ok7NGvFdbQ4nyYdfX4OJT+JM2TzdGPFX7JDRMkg2zxWO/1YdB72Ev8amx0BnW7Hb UvvNDC4FWVBVwk3VlZb+NMNrS1MLEmglgp2wnf9Tn0E4gVVRS0HdgVBHTUSAvMx/HVUN 1OLsoOCoLTLg/nLfvso9Sk0Cw2w4tToFFOfOb1TDCAiZm0T6Lc66HBcr2y+Unz9RR6Ne lhM4ZTBGDKCXWOeFpWSaVNdidfWJ9KpbQYc9Dxnw5/ZtS6sERrgmDg8RVVKdvZoYvFTF QtAg== X-Gm-Message-State: ALyK8tKja/qjn/nZ/SPHbYVsxsoy0CyXyw3NLiJDppD2oHupDFDkxTSyj1FMqEBecp8cZcEkqR5EiHU2DABZ5qg6 X-Received: by 10.55.128.198 with SMTP id b189mr15183582qkd.76.1466353077819; Sun, 19 Jun 2016 09:17:57 -0700 (PDT) MIME-Version: 1.0 Received: by 10.237.39.131 with HTTP; Sun, 19 Jun 2016 09:17:57 -0700 (PDT) From: Mike Heffner Date: Sun, 19 Jun 2016 12:17:57 -0400 Message-ID: Subject: Optimizing IOPS during sequential I/O (compactions) To: dev@cassandra.apache.org Content-Type: multipart/alternative; boundary=94eb2c063b3851b1020535a3ee1d archived-at: Sun, 19 Jun 2016 16:18:11 -0000 --94eb2c063b3851b1020535a3ee1d Content-Type: text/plain; charset=UTF-8 Hi, I'm curious to know if anyone has attempted to improve read IOPS performance during sequential I/O operations (largely compactions) while still maintaining read performance for small row, random-access client reads? Our use case is very high write load to read load ratio with rows that are small (< 1-2kb). We've taken many of the steps to ensure that client reads to random-access rows are optimal by reducing read_ahead and even using a smaller default LZ4 chunk size. So far performance has been great with p95 read times that are < 10ms. However, we have noticed that our total read IOPS to the Cassandra data drive is extremely high compared to our write IOPS, almost 15x the write IOPS to the same drive. We even setup a ring that took the same write load with zero client reads and observed that the high read IOPS were driven by compaction operations. During large (>50GB) compactions, write and read volume (bytes) were nearly identical which matched our assumptions, while read iops were 15x write iops. When we plotted the average read and write op size we saw an average read ops size of just under 5KB and average write op of 120KB. Given we are using the default disk access mode of mmap, this aligns with our assumption that we are paging in a single 4KB page at a time while the write size is coalescing write flushes. We wanted to test this, so we switched a single node to `disk_access_mode:standard`, which should do reads based on the chunksizes, and found that read op size increased to ~7.5KB: https://imgur.com/okbfFby We don't want to sacrifice our read performance, but we also must scale/size our disk performance based on peak iops. If we could cut the read iops by a quarter or even half during compaction operations, that would mean a large cost savings. We are also limited on drive throughput, so there's a theoretical maximum op size we'd want to use to stay under that throughput limit. Alternatively, we could also tune compaction throughput to maintain that limit too. Has any work been done to optimize sequential I/O operations in Cassandra? Naively it seems that sequential I/O operations could use a standard disk access mode reader with configurable block size while normal read operations stuck to the mmap'd segments. Being unfamiliar with the code, are compaction/sequential sstable reads done through any single interface or does it use the same as normal read ops? Thoughts? -Mike -- Mike Heffner Librato, Inc. --94eb2c063b3851b1020535a3ee1d--