Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 80BB371C8 for ; Fri, 19 Aug 2011 18:13:45 +0000 (UTC) Received: (qmail 39622 invoked by uid 500); 19 Aug 2011 18:13:43 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 39554 invoked by uid 500); 19 Aug 2011 18:13:42 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 39546 invoked by uid 99); 19 Aug 2011 18:13:42 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Aug 2011 18:13:42 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.220.172] (HELO mail-vx0-f172.google.com) (209.85.220.172) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Aug 2011 18:13:33 +0000 Received: by vxi29 with SMTP id 29so3505800vxi.31 for ; Fri, 19 Aug 2011 11:13:12 -0700 (PDT) MIME-Version: 1.0 Received: by 10.52.180.67 with SMTP id dm3mr60853vdc.230.1313777592561; Fri, 19 Aug 2011 11:13:12 -0700 (PDT) Sender: scode@scode.org Received: by 10.52.164.131 with HTTP; Fri, 19 Aug 2011 11:13:12 -0700 (PDT) X-Originating-IP: [213.114.154.128] In-Reply-To: References: Date: Fri, 19 Aug 2011 20:13:12 +0200 X-Google-Sender-Auth: xTpJfbGcfvwvmycV0ws86HXLtl8 Message-ID: Subject: Re: Nodetool repair takes 4+ hours for about 10G data From: Peter Schuller To: user@cassandra.apache.org Content-Type: text/plain; charset=UTF-8 X-Virus-Checked: Checked by ClamAV on apache.org > Is it normal that the repair takes 4+ hours for every node, with only about 10G data? If this is not expected, do we have any hint what could be causing this? It does not seem entirely crazy, depending on the nature of your data and how CPU-intensive it is "per byte" to compact. Assuming there is no functional problem that is delaying this, the question is what the bottleneck is. If you have a lot of read traffic that is keeping the drives busy, it might be that compaction is throttling on reading from disk (despite being sequential for the compaction) because of the live reads. Else you might be CPU bound (you can use something like htop to gauge fairly well whether you seem to be saturating a core doing compaction). To be clear, the processes to watch for are: * The "validating compaction" happening on the node repairing AND ITS NEIGHBORS - can be CPU or I/O bound (or throttled) - nodetool compactionstats, htop, iostat -x -k 1 * Streaming of data - can be network or disk bound (maybe throttled if the streaming throttling is in the version you're running) - nodetool netstats, ifstat, iostat -x -k 1 * The "sstable rebuild" compaction happening after streaming, building bloom filters and indexes. Can be CPU or I/O bound (or throttled) - nodetool compactionstats, htop, iostat -x -k 1 -- / Peter Schuller (@scode on twitter)