cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yuki Morishita (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-9491) Inefficient sequential repairs against vnode clusters
Date Fri, 29 May 2015 14:11:17 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-9491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14564863#comment-14564863
] 

Yuki Morishita commented on CASSANDRA-9491:
-------------------------------------------

Snapshot is taken for creating Merkle Tree one replica node at a time.
If we take snapshot just once in the beginning, by the time 256th range is calculating Merkle
Tree out of it, it can be repairing the state of many hours ago.

So maybe it would be better to set some window for reusing snapshot, instead of taking snapshot
just once.

> Inefficient sequential repairs against vnode clusters
> -----------------------------------------------------
>
>                 Key: CASSANDRA-9491
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9491
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Robert Stupp
>            Assignee: Yuki Morishita
>            Priority: Minor
>
> I've got a cluster with vnodes enabled. People regularly run sequential repairs against
that cluster.
> During such a sequential repair (just {{nodetool -pr}}, statistics show:
> * huge increase of live-sstable-count (approx doubling the amount),
> * huge amount of memtable-switches (approx 1200 per node per minute),
> * huge number of flushed (approx 25 per node per minute)
> * memtable-data-size drops to (nearly) 0
> * huge amount of compaction-completed-tasks (60k per minute) and compacted-bytes (25GB
per minute)
> These numbers do not match the real, tiny workload that the cluster really has.
> The reason for these (IMO crazy) numbers is the way how sequential repairs work on vnode
clusters:
> Starting at {{StorageService.forceRepairAsync}} (from {{nodetool -pr}}, a repair on the
ranges from {{getLocalPrimaryRanges(keyspace)}} is initiated. I'll express the schema in pseudo-code:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach range in ranges:
> {
> 	foreach columnFamily
> 	{
> 		start async RepairJob
> 		{
> 			if sequentialRepair:
> 				start SnapshotTask against each endpoint (including self)
> 				send tree requests if snapshot successful
> 			else // if parallel repair
> 				send tree requests
> 		}
> 	}
> }
> {code}
> This means, that for each sequential repair, a snapshot (including all its implications
like flushes, tiny sstables, followup-compactions) is taken for every range. That means 256
snapshots per column-family per repair on each (involved) endpoint. For about 20 tables, this
could mean 5120 snapshots within a very short period of time. You do not realize that amount
on the file system, since the _tag_ for the snapshot is always the same - so all snapshots
end in the same directory.
> IMO it would be sufficient to snapshot only once per column-family. Or do I miss something?
> So basically changing the pseudo-code to:
> {code}
> ranges = getLocalPrimaryRanges(keyspace)
> foreach columnFamily
> {
> 	if sequentialRepair:
> 		start SnapshotTask against each endpoint (including self)
> }
> foreach range in ranges:
> {
> 	start async RepairJob
> 	{
> 		send tree requests (if snapshot successful)
> 	}
> }
> {code}
> NB: The code's similar in all versions (checked 2.0.11, 2.0.15, 2.1, 2.2, trunk)
> EDIT: corrected target pseudo-code



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message