From user-return-18955-apmail-cassandra-user-archive=cassandra.apache.org@cassandra.apache.org Wed Jul 20 23:55:25 2011 Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 70483749B for ; Wed, 20 Jul 2011 23:55:25 +0000 (UTC) Received: (qmail 37364 invoked by uid 500); 20 Jul 2011 23:55:23 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 37158 invoked by uid 500); 20 Jul 2011 23:55:22 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 37150 invoked by uid 99); 20 Jul 2011 23:55:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Jul 2011 23:55:21 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,MIME_QP_LONG_LINE,RCVD_IN_DNSWL_NONE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [208.113.200.5] (HELO homiemail-a57.g.dreamhost.com) (208.113.200.5) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 20 Jul 2011 23:55:13 +0000 Received: from homiemail-a57.g.dreamhost.com (localhost [127.0.0.1]) by homiemail-a57.g.dreamhost.com (Postfix) with ESMTP id 2AFAC20805B for ; Wed, 20 Jul 2011 16:54:52 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; c=nofws; d=thelastpickle.com; h=subject :references:from:content-type:in-reply-to:message-id:date:to :content-transfer-encoding:mime-version; q=dns; s= thelastpickle.com; b=Rdp9L/ZOfxNjN70Y64cVMdacG8VjSHQtapFLKOXj74O ntvxRf78SDI4zykgW+E+1ZT6tpbQjHR6rsWzdxF8r0gdlpb769xsdp7I02GKENY+ 1QtZ6GANWtyrY0x9wXC+H2ATox4phYjCZM98nvPWQC8H4kSkmH/5WJhT6HiQuzgE = DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=thelastpickle.com; h= subject:references:from:content-type:in-reply-to:message-id:date :to:content-transfer-encoding:mime-version; s=thelastpickle.com; bh=0AW27At7ek0CKsjt2yzAz8XBoMk=; b=g5E4RQPHdizKPxAVg5CZN//Q9xgl PRFAcOKM+JWTUAh1Ayeu21JLn6QwmepIy1tJVNhq+Ce7x0kh5zicFXuUOMTb620d 05xjaK2wCJzwueEwl96TvYMUTPo8n6s6jCT8xQT1QDg+t47svlUFN/gEEkWx9Kvc jVWsNHu6wMquN1w= Received: from [10.0.1.151] (121-73-157-230.cable.telstraclear.net [121.73.157.230]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) (Authenticated sender: aaron@thelastpickle.com) by homiemail-a57.g.dreamhost.com (Postfix) with ESMTPSA id B6215208056 for ; Wed, 20 Jul 2011 16:54:51 -0700 (PDT) Subject: Re: node repair eat up all disk io and slow down entire cluster(3 nodes) References: From: Aaron Morton Content-Type: multipart/alternative; boundary=Apple-Mail-1-39003387 X-Mailer: iPad Mail (8K2) In-Reply-To: Message-Id: Date: Thu, 21 Jul 2011 11:56:04 +1200 To: "user@cassandra.apache.org" Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (iPad Mail 8K2) X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail-1-39003387 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 If you have never run repair also check the section on repair on this page=20= http://wiki.apache.org/cassandra/Operations About how frequently it should b= e run. There is an issue where repair can stream too much data, and this can lead t= o excessive disk use. My non scientific approach to the never run repair before problem is to repa= ir a single CF at a time, starting with the small ones that are less likely t= o have differences as they will stream the smallest amount of data.=20 If you really want to conserve disk IO during the repair consider disabling t= he minor compaction by setting the min and max thresholds to 0 via node tool= . hope that helps. ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 20/07/2011, at 11:46 PM, Yan Chunlu wrote: > just found this: > https://issues.apache.org/jira/browse/CASSANDRA-2156 >=20 > but seems only available to 0.8 and people submitted a patch for 0.6, I am= using 0.7.4, do I need to dig into the code and make my own patch? >=20 > does add compaction throttle solve the io problem? thanks! >=20 > On Wed, Jul 20, 2011 at 4:44 PM, Yan Chunlu wrote:= > at the beginning of using cassandra, I have no idea that I should run "nod= e repair" frequently, so basically, I have 3 nodes with RF=3D3 and have not r= un node repair for months, the data size is 20G. >=20 > the problem is when I start running node repair now, it eat up all disk io= and the server load became 20+ and increasing, the worst thing is, the enti= re cluster has slowed down and can not handle request. so I have to stop it i= mmediately because it make my web service unavailable. >=20 > the server has Intel Xeon-Lynnfield 3470-Quadcore [2.93GHz] and 8G memory,= with Western Digital WD RE3 WD1002FBYS SATA disk. >=20 > I really have no idea what to do now, as currently I have already found so= me data loss, any suggestions would be appreciated. >=20 >=20 >=20 > --=20 > =E9=97=AB=E6=98=A5=E8=B7=AF --Apple-Mail-1-39003387 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
If you have never run repair also check= the section on repair on this page 
http://wiki.apache.org/cassandra/Operations Abo= ut how frequently it should be run.

There is an iss= ue where repair can stream too much data, and this can lead to excessive dis= k use.

My non scientific approach to the never run r= epair before problem is to repair a single CF at a time, starting with the s= mall ones that are less likely to have differences as they will stream the s= mallest amount of data. 

If you really want to= conserve disk IO during the repair consider disabling the minor compaction b= y setting the min and max thresholds to 0 via node tool.

hope that helps.


-----------------
Aaron Morton
Freelance Cassandra Developer
@aar= onmorton

On 20/07/2011, at 11:46 PM, Yan Chunlu &= lt;springrider@gmail.com> wr= ote:

just found this:=

but see= ms only available to 0.8 and people submitted a patch for 0.6, I am using 0.= 7.4, do I need to dig into the code and make my own patch?

does add compaction throttle solve the io problem?  = ;thanks!

On Wed, Jul 20, 2011 at 4:44 PM, Y= an Chunlu <springrider@gmail.com> wrote:
at the beginning of using cassandra, I have n= o idea that I should run "node repair" frequently, so basically, I have 3 no= des with RF=3D3 and have not run node repair for months, the data size is 20= G.

the problem is when I start running node repair now, it= eat up all disk io and the server load became 20+ and increasing, the worst= thing is, the entire cluster has slowed down and can not handle request. s= o I have to stop it immediately because it make my web service unavailable.<= /div>

the server has Intel Xeon-Ly= nnfield 3470-Quadcore [2.93GHz] and 8G memory, with Western Digital WD RE3 WD1002FBYS SATA disk.

I really have no idea= what to do now, as currently I have already found some data loss, any sugge= stions would be appreciated.



--
=E9=97=AB=E6=98=A5=E8=B7= =AF
= --Apple-Mail-1-39003387--