Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
Sender: scode@scode.org
In-Reply-To: <CF9D40F9-394A-4AF4-8DF3-935D6ACEDE34@rhapsody.com>
References: <CF9D40F9-394A-4AF4-8DF3-935D6ACEDE34@rhapsody.com>
Date: Fri, 19 Aug 2011 20:13:12 +0200
Message-ID: 
 <CAO5xsd2=fO0UQcshzAH-8b4FOzzDQcX782e6Ss-e70Sao5V_fw@mail.gmail.com>
Subject: Re: Nodetool repair takes 4+ hours for about 10G data
From: Peter Schuller <peter.schuller@infidyne.com>
To: user@cassandra.apache.org
Content-Type: text/plain; charset=UTF-8

> Is it normal that the repair takes 4+ hours for every node, with only about 10G data? If this is not expected, do we have any hint what could be causing this?

It does not seem entirely crazy, depending on the nature of your data
and how CPU-intensive it is "per byte" to compact.

Assuming there is no functional problem that is delaying this, the
question is what the bottleneck is. If you have a lot of read traffic
that is keeping the drives busy, it might be that compaction is
throttling on reading from disk (despite being sequential for the
compaction) because of the live reads. Else you might be CPU bound
(you can use something like htop to gauge fairly well whether you seem
to be saturating a core doing compaction).

To be clear, the processes to watch for are:

* The "validating compaction" happening on the node repairing AND ITS
NEIGHBORS - can be CPU or I/O bound (or throttled) - nodetool
compactionstats, htop, iostat -x -k 1
* Streaming of data - can be network or disk bound (maybe throttled if
the streaming throttling is in the version you're running) - nodetool
netstats, ifstat, iostat -x -k 1
* The "sstable rebuild" compaction happening after streaming, building
bloom filters and indexes. Can be CPU or I/O bound (or throttled) -
nodetool compactionstats, htop, iostat -x -k 1

-- 
/ Peter Schuller (@scode on twitter)