cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (CASSANDRA-9491) Inefficient sequential repairs against vnode clusters
Date Wed, 27 May 2015 12:56:18 GMT

     [ https://issues.apache.org/jira/browse/CASSANDRA-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Jonathan Ellis updated CASSANDRA-9491:
--------------------------------------
    Assignee: Yuki Morishita

> Inefficient sequential repairs against vnode clusters
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9491
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9491
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Yuki Morishita
>            Priority: Minor
>
> I've got a cluster with vnodes enabled. People regularly run sequential repairs against
that cluster.
> During such a sequential repair (just {{nodetool -pr}}, statistics show:
> * huge increase of live-sstable-count (approx doubling the amount),
> * huge amount of memtable-switches (approx 1200 per node per minute),
> * huge number of flushed (approx 25 per node per minute)
> * memtable-data-size drops to (nearly) 0
> * huge amount of compaction-completed-tasks (60k per minute) and compacted-bytes (25GB
per minute)
> These numbers do not match the real, tiny workload that the cluster really has.
> The reason for these (IMO crazy) numbers is the way how sequential repairs work on vnode
clusters:
> Starting at {{StorageService.forceRepairAsync}} (from {{nodetool -pr}}, a repair on the
ranges from {{getLocalPrimaryRanges(keyspace)}} is initiated. I'll express the schema in pseudo-code:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
> 	foreach columnFamily
> 	{
> 		start async RepairJob
> 		{
> 			if sequentialRepair:
> 				start SnapshotTask against each endpoint (including self)
> 				send tree requests if snapshot successful
> 			else // if parallel repair
> 				send tree requests
> 		}
> 	}
> }
> {code}
> This means, that for each sequential repair, a snapshot (including all its implications
like flushes, tiny sstables, followup-compactions) is taken for every range. That means 256
snapshots per column-family per repair on each (involved) endpoint. For about 20 tables, this
could mean 5120 snapshots within a very short period of time. You do not realize that amount
on the file system, since the _tag_ for the snapshot is always the same - so all snapshots
end in the same directory.
> IMO it would be sufficient to snapshot only once per column-family. Or do I miss something?
> So basically changing the pseudo-code to:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
> 	foreach columnFamily
> 	{
> 		if sequentialRepair:
> 			start SnapshotTask against each endpoint (including self)
> 		start async RepairJob
> 		{
> 			send tree requests (if snapshot successful)
> 		}
> 	}
> }
> {code}
> NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, trunk)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message