Return-Path: X-Original-To: apmail-cassandra-commits-archive@www.apache.org Delivered-To: apmail-cassandra-commits-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id BF17418776 for ; Wed, 27 May 2015 12:56:31 +0000 (UTC) Received: (qmail 1051 invoked by uid 500); 27 May 2015 12:56:19 -0000 Delivered-To: apmail-cassandra-commits-archive@cassandra.apache.org Received: (qmail 1014 invoked by uid 500); 27 May 2015 12:56:19 -0000 Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@cassandra.apache.org Delivered-To: mailing list commits@cassandra.apache.org Received: (qmail 1003 invoked by uid 99); 27 May 2015 12:56:19 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 27 May 2015 12:56:19 +0000 Date: Wed, 27 May 2015 12:56:18 +0000 (UTC) From: "Jonathan Ellis (JIRA)" To: commits@cassandra.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (CASSANDRA-9491) Inefficient sequential repairs against vnode clusters MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CASSANDRA-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Ellis updated CASSANDRA-9491: -------------------------------------- Assignee: Yuki Morishita > Inefficient sequential repairs against vnode clusters > ----------------------------------------------------- > > Key: CASSANDRA-9491 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9491 > Project: Cassandra > Issue Type: Improvement > Reporter: Robert Stupp > Assignee: Yuki Morishita > Priority: Minor > > I've got a cluster with vnodes enabled. People regularly run sequential repairs against that cluster. > During such a sequential repair (just {{nodetool -pr}}, statistics show: > * huge increase of live-sstable-count (approx doubling the amount), > * huge amount of memtable-switches (approx 1200 per node per minute), > * huge number of flushed (approx 25 per node per minute) > * memtable-data-size drops to (nearly) 0 > * huge amount of compaction-completed-tasks (60k per minute) and compacted-bytes (25GB per minute) > These numbers do not match the real, tiny workload that the cluster really has. > The reason for these (IMO crazy) numbers is the way how sequential repairs work on vnode clusters: > Starting at {{StorageService.forceRepairAsync}} (from {{nodetool -pr}}, a repair on the ranges from {{getLocalPrimaryRanges(keyspace)}} is initiated. I'll express the schema in pseudo-code: > {code} > ranges = getLocalPrimaryRanges(keyspace) > foreach range in ranges: > { > foreach columnFamily > { > start async RepairJob > { > if sequentialRepair: > start SnapshotTask against each endpoint (including self) > send tree requests if snapshot successful > else // if parallel repair > send tree requests > } > } > } > {code} > This means, that for each sequential repair, a snapshot (including all its implications like flushes, tiny sstables, followup-compactions) is taken for every range. That means 256 snapshots per column-family per repair on each (involved) endpoint. For about 20 tables, this could mean 5120 snapshots within a very short period of time. You do not realize that amount on the file system, since the _tag_ for the snapshot is always the same - so all snapshots end in the same directory. > IMO it would be sufficient to snapshot only once per column-family. Or do I miss something? > So basically changing the pseudo-code to: > {code} > ranges = getLocalPrimaryRanges(keyspace) > foreach range in ranges: > { > foreach columnFamily > { > if sequentialRepair: > start SnapshotTask against each endpoint (including self) > start async RepairJob > { > send tree requests (if snapshot successful) > } > } > } > {code} > NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, trunk) -- This message was sent by Atlassian JIRA (v6.3.4#6332)