Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of springrider@gmail.com
 designates 209.85.215.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAKkz8Q3he=hjN-AbUBhJ0KmtAGOQymm+5gCCwpZ3ekAhJVViSA@mail.gmail.com>
References: 
 <CAOA66tG+suc2J20czNMt50z-EHAS-u4Bj4aD1HsjjaRT7Koyag@mail.gmail.com>
 <CAKkz8Q2zgQCt0nTo4Bk337LcVvKK+w_Ka9FH8a_oiVKSPcNOog@mail.gmail.com>
 <CAOA66tEn76Uc8Sce5oPo+tv2rsNkpaM18UV_K+VJ5vDvVO5_xA@mail.gmail.com>
 <CAO5xsd0XpMhn7JjPq53kG3wgDo1ifcrvNFQ4qHasKQWSasNvGw@mail.gmail.com>
 <CAOA66tHpH66i3pUjTxvvr1gLKc4u+j=dmnREany6dK0gdqj1Bw@mail.gmail.com>
 <CAO5xsd3f29SUbMxrzNGz1os11SqQo-bpymAij2kVuhDL-cSSjw@mail.gmail.com>
 <CAOA66tE3tZdOnjQxiqZxv8GdZtLzANQBd9WwGNmeKbyOsPu+iQ@mail.gmail.com>
 <CAKkz8Q1gjm+Zv1r6kgtSP53wNPa4vArB_7UD=qqkXxk25GFn4w@mail.gmail.com>
 <CAOA66tGn8UpTLSB5p6Tw7UCKuiRWk_LDgD0Oit0hdAd5YR8=0Q@mail.gmail.com>
 <CAKkz8Q3he=hjN-AbUBhJ0KmtAGOQymm+5gCCwpZ3ekAhJVViSA@mail.gmail.com>
From: Yan Chunlu <springrider@gmail.com>
Date: Wed, 14 Sep 2011 16:53:32 +0800
Message-ID: 
 <CAOA66tGopVfDd8a0+ioBL=o8F2fsN64qr-q_C8W00rPRj4=gGw@mail.gmail.com>
Subject: Re: what's the difference between repair CF separately and repair the
 entire node?
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0015174c3ffe3e6e9104ace2e64e

--0015174c3ffe3e6e9104ace2e64e
Content-Type: text/plain; charset=ISO-8859-1

thanks a lot for the help!

 I have read the post and think 0.8 might be good enough for me, especially
0.8.5.

also change gc_grace_seconds is a acceptable solution.


On Wed, Sep 14, 2011 at 4:03 PM, Sylvain Lebresne <sylvain@datastax.com>wrote:

> On Wed, Sep 14, 2011 at 9:27 AM, Yan Chunlu <springrider@gmail.com> wrote:
> > is 0.8 ready for production use?
>
> some related discussion here:
> http://www.mail-archive.com/user@cassandra.apache.org/msg17055.html
> but my personal answer is yes.
>
> >  as I know currently many companies including reddit.com are using 0.7,
> how
> > does they get rid of the repair problem?
>
> Repair problems in 0.7 don't hit everyone equally. For some people, it
> works
> relatively well even if not in the most efficient ways. Also, for some
> workload
> (if you don't do  much deletes for instance), you can set a big
> gc_grace_seconds
> value (say a month) and only run repair that often, which can make repair
> inefficiencies more bearable.
> That being said, I can't speak for "many companies", but I do advise
> evaluating
> an upgrade to 0.8.
>
> --
> Sylvain
>
> >
> > On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne <sylvain@datastax.com>
> > wrote:
> >>
> >> On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu <springrider@gmail.com>
> wrote:
> >> > me neither don't want to repair one CF at the time.
> >> > the "node repair" took a week and still running, compactionstats and
> >> > netstream shows nothing is running on every node,  and also no error
> >> > message, no exception, really no idea what was it doing,
> >>
> >> To add to the list of things repair does wrong in 0.7, we'll have to add
> >> that
> >> if one of the node participating in the repair (so any node that share a
> >> range
> >> with the node on which repair was started) goes down (even for a short
> >> time),
> >> then the repair will simply hang forever doing nothing. And no specific
> >> error message will be logged. That could be what happened. Again, recent
> >> releases of 0.8 fix that too.
> >>
> >> --
> >> Sylvain
> >>
> >> > I stopped yesterday.  maybe I should run repair again while disable
> >> > compaction on all nodes?
> >> > thanks!
> >> >
> >> > On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller
> >> > <peter.schuller@infidyne.com> wrote:
> >> >>
> >> >> > I think it is a serious problem since I can not "repair".....  I am
> >> >> > using cassandra on production servers. is there some way to fix it
> >> >> > without upgrade?  I heard of that 0.8.x is still not quite ready in
> >> >> > production environment.
> >> >>
> >> >> It is a serious issue if you really need to repair one CF at the
> time.
> >> >> However, looking at your original post it seems this is not
> >> >> necessarily your issue. Do you need to, or was your concern rather
> the
> >> >> overall time repair took?
> >> >>
> >> >> There are other things that are improved in 0.8 that affect 0.7. In
> >> >> particular, (1) in 0.7 compaction, including validating compactions
> >> >> that are part of repair, is non-concurrent so if your repair starts
> >> >> while there is a long-running compaction going it will have to wait,
> >> >> and (2) semi-related is that the merkle tree calculation that is part
> >> >> of repair/anti-entropy may happen "out of synch" if one of the nodes
> >> >> participating happen to be busy with compaction. This in turns causes
> >> >> additional data to be sent as part of repair.
> >> >>
> >> >> That might be why your immediately following repair took a long time,
> >> >> but it's difficult to tell.
> >> >>
> >> >> If you're having issues with repair and large data sets, I would
> >> >> generally say that upgrading to 0.8 is recommended. However, if
> you're
> >> >> on 0.7.4, beware of
> >> >> https://issues.apache.org/jira/browse/CASSANDRA-3166
> >> >>
> >> >> --
> >> >> / Peter Schuller (@scode on twitter)
> >> >
> >> >
> >
> >
>

--0015174c3ffe3e6e9104ace2e64e
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

thanks a lot for the help!=A0<div><br></div><div>=A0I have read the post an=
d think 0.8 might be good enough for me, especially 0.8.5.</div><div><br></=
div><div>also change gc_grace_seconds is a acceptable solution.<br><div><br=
>

</div><div><br><br><div class=3D"gmail_quote">On Wed, Sep 14, 2011 at 4:03 =
PM, Sylvain Lebresne <span dir=3D"ltr">&lt;<a href=3D"mailto:sylvain@datast=
ax.com">sylvain@datastax.com</a>&gt;</span> wrote:<br><blockquote class=3D"=
gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-=
left:1ex;">

<div class=3D"im">On Wed, Sep 14, 2011 at 9:27 AM, Yan Chunlu &lt;<a href=
=3D"mailto:springrider@gmail.com">springrider@gmail.com</a>&gt; wrote:<br>
&gt; is 0.8 ready for production use?<br>
<br>
</div>some related discussion here:<br>
<a href=3D"http://www.mail-archive.com/user@cassandra.apache.org/msg17055.h=
tml" target=3D"_blank">http://www.mail-archive.com/user@cassandra.apache.or=
g/msg17055.html</a><br>
but my personal answer is yes.<br>
<div class=3D"im"><br>
&gt; =A0as I know currently many companies including <a href=3D"http://redd=
it.com" target=3D"_blank">reddit.com</a> are using 0.7, how<br>
&gt; does they get rid of the repair problem?<br>
<br>
</div>Repair problems in 0.7 don&#39;t hit everyone equally. For some peopl=
e, it works<br>
relatively well even if not in the most efficient ways. Also, for some work=
load<br>
(if you don&#39;t do =A0much deletes for instance), you can set a big gc_gr=
ace_seconds<br>
value (say a month) and only run repair that often, which can make repair<b=
r>
inefficiencies more bearable.<br>
That being said, I can&#39;t speak for &quot;many companies&quot;, but I do=
 advise evaluating<br>
an upgrade to 0.8.<br>
<font color=3D"#888888"><br>
--<br>
Sylvain<br>
</font><div><div></div><div class=3D"h5"><br>
&gt;<br>
&gt; On Wed, Sep 14, 2011 at 2:47 PM, Sylvain Lebresne &lt;<a href=3D"mailt=
o:sylvain@datastax.com">sylvain@datastax.com</a>&gt;<br>
&gt; wrote:<br>
&gt;&gt;<br>
&gt;&gt; On Wed, Sep 14, 2011 at 2:38 AM, Yan Chunlu &lt;<a href=3D"mailto:=
springrider@gmail.com">springrider@gmail.com</a>&gt; wrote:<br>
&gt;&gt; &gt; me neither don&#39;t want to repair one CF at the time.<br>
&gt;&gt; &gt; the &quot;node repair&quot; took a week and still running, co=
mpactionstats and<br>
&gt;&gt; &gt; netstream shows nothing is running on every node, =A0and also=
 no error<br>
&gt;&gt; &gt; message, no exception, really no idea what was it doing,<br>
&gt;&gt;<br>
&gt;&gt; To add to the list of things repair does wrong in 0.7, we&#39;ll h=
ave to add<br>
&gt;&gt; that<br>
&gt;&gt; if one of the node participating in the repair (so any node that s=
hare a<br>
&gt;&gt; range<br>
&gt;&gt; with the node on which repair was started) goes down (even for a s=
hort<br>
&gt;&gt; time),<br>
&gt;&gt; then the repair will simply hang forever doing nothing. And no spe=
cific<br>
&gt;&gt; error message will be logged. That could be what happened. Again, =
recent<br>
&gt;&gt; releases of 0.8 fix that too.<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Sylvain<br>
&gt;&gt;<br>
&gt;&gt; &gt; I stopped yesterday. =A0maybe I should run repair again while=
 disable<br>
&gt;&gt; &gt; compaction on all nodes?<br>
&gt;&gt; &gt; thanks!<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; On Wed, Sep 14, 2011 at 6:57 AM, Peter Schuller<br>
&gt;&gt; &gt; &lt;<a href=3D"mailto:peter.schuller@infidyne.com">peter.schu=
ller@infidyne.com</a>&gt; wrote:<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; &gt; I think it is a serious problem since I can not &quo=
t;repair&quot;..... =A0I am<br>
&gt;&gt; &gt;&gt; &gt; using cassandra on production servers. is there some=
 way to fix it<br>
&gt;&gt; &gt;&gt; &gt; without upgrade? =A0I heard of that 0.8.x is still n=
ot quite ready in<br>
&gt;&gt; &gt;&gt; &gt; production environment.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; It is a serious issue if you really need to repair one CF=
 at the time.<br>
&gt;&gt; &gt;&gt; However, looking at your original post it seems this is n=
ot<br>
&gt;&gt; &gt;&gt; necessarily your issue. Do you need to, or was your conce=
rn rather the<br>
&gt;&gt; &gt;&gt; overall time repair took?<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; There are other things that are improved in 0.8 that affe=
ct 0.7. In<br>
&gt;&gt; &gt;&gt; particular, (1) in 0.7 compaction, including validating c=
ompactions<br>
&gt;&gt; &gt;&gt; that are part of repair, is non-concurrent so if your rep=
air starts<br>
&gt;&gt; &gt;&gt; while there is a long-running compaction going it will ha=
ve to wait,<br>
&gt;&gt; &gt;&gt; and (2) semi-related is that the merkle tree calculation =
that is part<br>
&gt;&gt; &gt;&gt; of repair/anti-entropy may happen &quot;out of synch&quot=
; if one of the nodes<br>
&gt;&gt; &gt;&gt; participating happen to be busy with compaction. This in =
turns causes<br>
&gt;&gt; &gt;&gt; additional data to be sent as part of repair.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; That might be why your immediately following repair took =
a long time,<br>
&gt;&gt; &gt;&gt; but it&#39;s difficult to tell.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; If you&#39;re having issues with repair and large data se=
ts, I would<br>
&gt;&gt; &gt;&gt; generally say that upgrading to 0.8 is recommended. Howev=
er, if you&#39;re<br>
&gt;&gt; &gt;&gt; on 0.7.4, beware of<br>
&gt;&gt; &gt;&gt; <a href=3D"https://issues.apache.org/jira/browse/CASSANDR=
A-3166" target=3D"_blank">https://issues.apache.org/jira/browse/CASSANDRA-3=
166</a><br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; --<br>
&gt;&gt; &gt;&gt; / Peter Schuller (@scode on twitter)<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;<br>
&gt;<br>
</div></div></blockquote></div><br></div></div>

--0015174c3ffe3e6e9104ace2e64e--