Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (athena.apache.org: domain of arodrime@gmail.com designates
 209.85.215.44 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAENxBwyu+Kbqqp8SYByRrgAqE9=L8JTNgFf0iOG46xc6tL1x1g@mail.gmail.com>
References: <1368547830.22785.GenericBBA@web160901.mail.bf1.yahoo.com>
 <CA+VSrLqaoGX=hAPJuGk_fZgq9D3QiDqhzRydaFQO7V7LXs8P4g@mail.gmail.com>
 <CAEDUwd1Zn1KD4WUyR-j3--NA0vNdu1OmmmUzz-btNX51pV0fPg@mail.gmail.com>
 <CAENxBwyu+Kbqqp8SYByRrgAqE9=L8JTNgFf0iOG46xc6tL1x1g@mail.gmail.com>
From: Alain RODRIGUEZ <arodrime@gmail.com>
Date: Thu, 16 May 2013 13:49:08 +0200
Message-ID: 
 <CA+VSrLqT111KjNih9LuOgHdO1psOEmibiBU98FyPxLD8sL5w+g@mail.gmail.com>
Subject: Re: (unofficial) Community Poll for Production Operators : Repair
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=089e0160a3b870f26f04dcd47509

--089e0160a3b870f26f04dcd47509
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

@Rob: Thanks about the feedback.

Yet I have a weird behavior still unexplained about repairing. Are counters
supposed to be "repaired" too ? I mean, while reading at CL.ONE I can have
different values depending on what node is answering. Even after a read
repair or a full repair. Shouldn't a repair fix these discrepancies ?

The only way I found to get always the same count is to read data at
CL.QUORUM, but this is a workaround since the data itself remains wrong on
some nodes.

Any clue on it ?

Alain

2013/5/15 Edward Capriolo <edlinuxguru@gmail.com>

> http://basho.com/introducing-riak-1-3/
>
> Introduced Active Anti-Entropy. Riak now has active anti-entropy. In
> distributed systems, inconsistencies can arise between replicas due to
> failure modes, concurrent updates, and physical data loss or corruption.
> Pre-1.3 Riak already had several features for repairing this =93entropy=
=94, but
> they all required some form of user intervention. Riak 1.3 introduces
> automatic, self-healing properties that repair entropy on an ongoing basi=
s.
>
>
> On Wed, May 15, 2013 at 5:32 PM, Robert Coli <rcoli@eventbrite.com> wrote=
:
>
>> On Wed, May 15, 2013 at 1:27 AM, Alain RODRIGUEZ <arodrime@gmail.com>
>> wrote:
>> > Rob, I was wondering something. Are you a commiter working on improvin=
g
>> the
>> > repair or something similar ?
>>
>> I am not a committer [1], but I have an active interest in potential
>> improvements to the best practices for repair. The specific change
>> that I am considering is a modification to the default
>> gc_grace_seconds value, which seems picked out of a hat at 10 days. My
>> view is that the current implementation of repair has such negative
>> performance consequences that I do not believe that holding onto
>> tombstones for longer than 10 days could possibly be as bad as the
>> fixed cost of running repair once every 10 days. I believe that this
>> value is too low for a default (it also does not map cleanly to the
>> work week!) and likely should be increased to 14, 21 or 28 days.
>>
>> > Anyway, if a commiter (or any other expert) could give us some feedbac=
k
>> on
>> > our comments (Are we doing well or not, whether things we observe are
>> normal
>> > or unexplained, what is going to be improved in the future about
>> repair...)
>>
>> 1) you are doing things according to best practice
>> 2) unfortunately your experience with significantly degraded
>> performance, including a blocked go-live due to repair bloat is pretty
>> typical
>> 3) the things you are experiencing are part of the current
>> implementation of repair and are also typical, however I do not
>> believe they are fully "explained" [2]
>> 4) as has been mentioned further down thread, there are discussions
>> regarding (and some already committed) improvements to both the
>> current repair paradigm and an evolution to a new paradigm
>>
>> Thanks to all for the responses so far, please keep them coming! :D
>>
>> =3DRob
>> [1] hence the (unofficial) tag for this thread. I do have minor
>> patches accepted to the codebase, but always merged by an actual
>> committer. :)
>> [2] driftx@#cassandra feels that these things are explained/understood
>> by core team, and points to
>> https://issues.apache.org/jira/browse/CASSANDRA-5280 as a useful
>> approach to minimize same.
>>
>
>

--089e0160a3b870f26f04dcd47509
Content-Type: text/html; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">@Rob: Thanks about the feedback.<div><br></div><div>Yet I =
have a weird behavior still unexplained about repairing. Are counters suppo=
sed to be &quot;repaired&quot; too ? I mean, while reading at CL.ONE I can =
have different values depending on what node is answering. Even after a rea=
d repair or a full repair. Shouldn&#39;t a repair fix these discrepancies ?=
</div>


<div><br></div><div>The only way I found to get always the same count is to=
 read data at CL.QUORUM, but this is a workaround since the data itself rem=
ains wrong on some nodes.=A0</div><div class=3D"gmail_extra"><br></div><div=
 class=3D"gmail_extra">


Any clue on it ?</div><div class=3D"gmail_extra"><br></div><div class=3D"gm=
ail_extra">Alain<br><br><div class=3D"gmail_quote">2013/5/15 Edward Capriol=
o <span dir=3D"ltr">&lt;<a href=3D"mailto:edlinuxguru@gmail.com" target=3D"=
_blank">edlinuxguru@gmail.com</a>&gt;</span><br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">
<div dir=3D"ltr"><a href=3D"http://basho.com/introducing-riak-1-3/" target=
=3D"_blank">http://basho.com/introducing-riak-1-3/</a><br><div><br>Introduc=
ed Active Anti-Entropy. Riak now has active anti-entropy. In distributed sy=
stems, inconsistencies can arise between replicas due to failure modes, con=
current updates, and physical data loss or corruption. Pre-1.3 Riak already=
 had several features for repairing this =93entropy=94, but they all requir=
ed some form of user intervention. Riak 1.3 introduces automatic, self-heal=
ing properties that repair entropy on an ongoing basis.<br>


</div></div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmai=
l_quote">On Wed, May 15, 2013 at 5:32 PM, Robert Coli <span dir=3D"ltr">&lt=
;<a href=3D"mailto:rcoli@eventbrite.com" target=3D"_blank">rcoli@eventbrite=
.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">On Wed, May 15, 2013 at 1:27 AM, Alain RODRI=
GUEZ &lt;<a href=3D"mailto:arodrime@gmail.com" target=3D"_blank">arodrime@g=
mail.com</a>&gt; wrote:<br>


&gt; Rob, I was wondering something. Are you a commiter working on improvin=
g the<br>
&gt; repair or something similar ?<br>
<br>
I am not a committer [1], but I have an active interest in potential<br>
improvements to the best practices for repair. The specific change<br>
that I am considering is a modification to the default<br>
gc_grace_seconds value, which seems picked out of a hat at 10 days. My<br>
view is that the current implementation of repair has such negative<br>
performance consequences that I do not believe that holding onto<br>
tombstones for longer than 10 days could possibly be as bad as the<br>
fixed cost of running repair once every 10 days. I believe that this<br>
value is too low for a default (it also does not map cleanly to the<br>
work week!) and likely should be increased to 14, 21 or 28 days.<br>
<br>
&gt; Anyway, if a commiter (or any other expert) could give us some feedbac=
k on<br>
&gt; our comments (Are we doing well or not, whether things we observe are =
normal<br>
&gt; or unexplained, what is going to be improved in the future about repai=
r...)<br>
<br>
1) you are doing things according to best practice<br>
2) unfortunately your experience with significantly degraded<br>
performance, including a blocked go-live due to repair bloat is pretty<br>
typical<br>
3) the things you are experiencing are part of the current<br>
implementation of repair and are also typical, however I do not<br>
believe they are fully &quot;explained&quot; [2]<br>
4) as has been mentioned further down thread, there are discussions<br>
regarding (and some already committed) improvements to both the<br>
current repair paradigm and an evolution to a new paradigm<br>
<br>
Thanks to all for the responses so far, please keep them coming! :D<br>
<br>
=3DRob<br>
[1] hence the (unofficial) tag for this thread. I do have minor<br>
patches accepted to the codebase, but always merged by an actual<br>
committer. :)<br>
[2] driftx@#cassandra feels that these things are explained/understood<br>
by core team, and points to<br>
<a href=3D"https://issues.apache.org/jira/browse/CASSANDRA-5280" target=3D"=
_blank">https://issues.apache.org/jira/browse/CASSANDRA-5280</a> as a usefu=
l<br>
approach to minimize same.<br>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div>

--089e0160a3b870f26f04dcd47509--