Return-Path: Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: (qmail 79956 invoked from network); 7 Mar 2011 17:18:54 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 7 Mar 2011 17:18:54 -0000 Received: (qmail 33935 invoked by uid 500); 7 Mar 2011 17:18:52 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 33883 invoked by uid 500); 7 Mar 2011 17:18:52 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 33875 invoked by uid 99); 7 Mar 2011 17:18:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 07 Mar 2011 17:18:52 +0000 X-ASF-Spam-Status: No, hits=0.0 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of daniel.doubleday@gmx.net designates 213.165.64.23 as permitted sender) Received: from [213.165.64.23] (HELO mailout-de.gmx.net) (213.165.64.23) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 07 Mar 2011 17:18:44 +0000 Received: (qmail invoked by alias); 07 Mar 2011 17:18:22 -0000 Received: from 91-64-48-139-dynip.superkabel.de (EHLO caladan.smeet.de) [91.64.48.139] by mail.gmx.net (mp037) with SMTP; 07 Mar 2011 18:18:22 +0100 X-Authenticated: #3445653 X-Provags-ID: V01U2FsdGVkX1/pDUB3VeHp2Z2C9GZRapWDrcVCuSXgF5ZCUJzMFF 3RxERmSewcFTi6 From: Daniel Doubleday Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Subject: Alternative to repair Date: Mon, 7 Mar 2011 18:18:18 +0100 Message-Id: To: user@cassandra.apache.org Mime-Version: 1.0 (Apple Message framework v1082) X-Mailer: Apple Mail (2.1082) X-Y-GMX-Trusted: 0 Hi all we're still on 0.6 and are facing problems with repairs.=20 I.e. a repair for one CF takes around 60h and we have to do that twice = (RF=3D3, 5 nodes). During that time the cluster is under pretty heavy IO = load. It kinda works but during peek times we see lots of dropped = messages (including writes). So we are actually creating inconsistencies = that we are trying to fix with the repair. Since we already have a very simple hadoopish framework in place which = allows us to do token range walks with multiple workers and restart at a = given position in case of failure I created a simple worker that would = read everything with CL_ALL. With only one worker and almost no = performance impact one scan took 7h. My understanding is that at that point due to read repair I got the same = as I would have achieved with repair runs. Is that true or am I missing something? Cheers, Daniel