Mailing-List: contact commits-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@cassandra.apache.org
Date: Wed, 27 May 2015 12:56:18 +0000 (UTC)
From: "Jonathan Ellis (JIRA)" <jira@apache.org>
To: commits@cassandra.apache.org
Message-ID: <JIRA.12833028.1432730156000.38276.1432731378598@Atlassian.JIRA>
In-Reply-To: <JIRA.12833028.1432730156000@Atlassian.JIRA>
References: <JIRA.12833028.1432730156000@Atlassian.JIRA>
 <JIRA.12833028.1432730156442@arcas>
Subject: [jira] [Updated] (CASSANDRA-9491) Inefficient sequential repairs
 against vnode clusters
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/CASSANDRA-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-9491:
--------------------------------------
    Assignee: Yuki Morishita

> Inefficient sequential repairs against vnode clusters
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9491
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9491
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Yuki Morishita
>            Priority: Minor
>
> I've got a cluster with vnodes enabled. People regularly run sequential repairs against that cluster.
> During such a sequential repair (just {{nodetool -pr}}, statistics show:
> * huge increase of live-sstable-count (approx doubling the amount),
> * huge amount of memtable-switches (approx 1200 per node per minute),
> * huge number of flushed (approx 25 per node per minute)
> * memtable-data-size drops to (nearly) 0
> * huge amount of compaction-completed-tasks (60k per minute) and compacted-bytes (25GB per minute)
> These numbers do not match the real, tiny workload that the cluster really has.
> The reason for these (IMO crazy) numbers is the way how sequential repairs work on vnode clusters:
> Starting at {{StorageService.forceRepairAsync}} (from {{nodetool -pr}}, a repair on the ranges from {{getLocalPrimaryRanges(keyspace)}} is initiated. I'll express the schema in pseudo-code:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
> 	foreach columnFamily
> 	{
> 		start async RepairJob
> 		{
> 			if sequentialRepair:
> 				start SnapshotTask against each endpoint (including self)
> 				send tree requests if snapshot successful
> 			else // if parallel repair
> 				send tree requests
> 		}
> 	}
> }
> {code}
> This means, that for each sequential repair, a snapshot (including all its implications like flushes, tiny sstables, followup-compactions) is taken for every range. That means 256 snapshots per column-family per repair on each (involved) endpoint. For about 20 tables, this could mean 5120 snapshots within a very short period of time. You do not realize that amount on the file system, since the _tag_ for the snapshot is always the same - so all snapshots end in the same directory.
> IMO it would be sufficient to snapshot only once per column-family. Or do I miss something?
> So basically changing the pseudo-code to:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
> 	foreach columnFamily
> 	{
> 		if sequentialRepair:
> 			start SnapshotTask against each endpoint (including self)
> 		start async RepairJob
> 		{
> 			send tree requests (if snapshot successful)
> 		}
> 	}
> }
> {code}
> NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, trunk)


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)