Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of daniel.doubleday@gmx.net
 designates 213.165.64.23 as permitted sender)
From: Daniel Doubleday <daniel.doubleday@gmx.net>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: quoted-printable
Subject: Alternative to repair
Date: Mon, 7 Mar 2011 18:18:18 +0100
Message-Id: <CBFFA8F7-0A4A-4811-A45A-6EF9B2D6CD20@gmx.net>
To: user@cassandra.apache.org
Mime-Version: 1.0 (Apple Message framework v1082)

Hi all

we're still on 0.6 and are facing problems with repairs.=20

I.e. a repair for one CF takes around 60h and we have to do that twice =
(RF=3D3, 5 nodes). During that time the cluster is under pretty heavy IO =
load. It kinda works but during peek times we see lots of dropped =
messages (including writes). So we are actually creating inconsistencies =
that we are trying to fix with the repair.

Since we already have a very simple hadoopish framework in place which =
allows us to do token range walks with multiple workers and restart at a =
given position in case of failure I created a simple worker that would =
read everything with CL_ALL. With only one worker and almost no =
performance impact one scan took 7h.

My understanding is that at that point due to read repair I got the same =
as I would have achieved with repair runs.

Is that true or am I missing something?

Cheers,
Daniel