Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of springrider@gmail.com
 designates 209.85.215.172 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <45543932-E5DF-48BD-952D-F99EA0AA9EC1@thelastpickle.com>
References: <1310143134.5666.1.camel@Avalon>
 <137920FE-1CF4-42E9-950E-6B7544B0662D@thelastpickle.com>
 <1310189845.1935.1.camel@Avalon>
 <CAO5xsd2KoT==b9jYqR2epMXJLMNZXs3cSS_LffM9=14wkg3cgA@mail.gmail.com>
 <CAOA66tEYS_O6Y_rPZ9jhNZvvfijLthFGd-Prvq+x0MMsG1c39Q@mail.gmail.com>
 <45543932-E5DF-48BD-952D-F99EA0AA9EC1@thelastpickle.com>
From: Yan Chunlu <springrider@gmail.com>
Date: Mon, 11 Jul 2011 10:31:04 +0800
Message-ID: 
 <CAOA66tFTHSbdgZbi60yXr6By7sEszm_NDWxTkUE7pPBrBK_RDA@mail.gmail.com>
Subject: Re: Corrupted data
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=00504502d234beb07c04a7c1fad7

--00504502d234beb07c04a7c1fad7
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

oh the error seems from jmx


sorry but seems I dont have more error messages, the node repair just never
ends... and strace the process find out nothing, it is not doing anything.

is there anyway to get more information about this?  do I need to do a majo=
r
compaction on every column family? thanks!

On Mon, Jul 11, 2011 at 1:36 AM, aaron morton <aaron@thelastpickle.com>wrot=
e:

> 1) do I need to treat every node as failure and do a rolling replacement?
>  since there might be some inconsistent in the cluster even I have no way=
 to
> find out.
>
> see
> http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences=
_of_nodetool_repair_not_running_within_GCGraceSeconds
>
>
> <http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequence=
s_of_nodetool_repair_not_running_within_GCGraceSeconds>
>
> 2) is that the reason that caused the node repair hung? the log message
> says:
> Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Checker-run
> WARNING: Failed to check the connection: java.net.SocketTimeoutException:
> Read timed out
>
> I cannot find that anywhere in the code base, can you provide some more
> information ?
>
> Cheers
>
>  -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 10 Jul 2011, at 03:26, Yan Chunlu wrote:
>
> I am running RF=3D2(I have changed it from 2->3 and back to 2) and 3 node=
s
> and didn't running node repair more than 10 days, did not aware of this i=
s
> critical.  I run node repair recently and one of the node always hung...
> from log it seems doing nothing related to the repair.
>
> so I got two problems:
>
> 1) do I need to treat every node as failure and do a rolling replacement?
>  since there might be some inconsistent in the cluster even I have no way=
 to
> find out.
> 2) is that the reason that caused the node repair hung? the log message
> says:
> Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Checker-run
> WARNING: Failed to check the connection: java.net.SocketTimeoutException:
> Read timed out
>
> then nothing.
>
> thanks!
>
> On Sat, Jul 9, 2011 at 10:16 PM, Peter Schuller <
> peter.schuller@infidyne.com> wrote:
>
>> >> - Have you been running repair consistently ?
>> >
>> > Nop, only when something breaks
>>
>> This is unrelated to the problem you were asking about, but if you
>> never run delete, make sure you are aware of:
>>
>> http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
>> http://wiki.apache.org/cassandra/DistributedDeletes
>>
>>
>> --
>> / Peter Schuller
>>
>
>
>
> --
> =E9=97=AB=E6=98=A5=E8=B7=AF
>
>
>


--=20
Charles

--00504502d234beb07c04a7c1fad7
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

oh the error seems from jmx<br><br><div><br></div><div>sorry but seems I do=
nt have more error messages, the node repair just never ends... and strace =
the process find out nothing, it is not doing anything.</div><div><br></div=
>


<div>is there anyway to get more information about this? =C2=A0do I need to=
 do a major compaction on every column family? thanks!</div><div><br><div c=
lass=3D"gmail_quote">On Mon, Jul 11, 2011 at 1:36 AM, aaron morton <span di=
r=3D"ltr">&lt;<a href=3D"mailto:aaron@thelastpickle.com" target=3D"_blank">=
aaron@thelastpickle.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div style=3D"word-wrap:break-word"><div><bl=
ockquote type=3D"cite"><div>1) do I need to treat every node as failure and=
 do a rolling replacement? =C2=A0since there might be some inconsistent in =
the cluster even I have no way to find out.</div>


</blockquote></div>see=C2=A0<a href=3D"http://wiki.apache.org/cassandra/Ope=
rations#Dealing_with_the_consequences_of_nodetool_repair_not_running_within=
_GCGraceSeconds" target=3D"_blank">http://wiki.apache.org/cassandra/Operati=
ons#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCG=
raceSeconds</a><div>


<br></div><div><div><a href=3D"http://wiki.apache.org/cassandra/Operations#=
Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCGrace=
Seconds" target=3D"_blank"></a><blockquote type=3D"cite"><div>2) is that th=
e reason that caused the node repair hung? the log message says:</div>


<div><div>Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Checker-run</div>=
<div>WARNING: Failed to check the connection: java.net.SocketTimeoutExcepti=
on: Read timed out</div></div></blockquote></div><div>I cannot find that an=
ywhere in the code base, can you provide some more information ?=C2=A0</div=
>


<div><div><br></div><div>Cheers</div><div><br></div><div>
<span style=3D"border-collapse:separate;color:rgb(0, 0, 0);font-family:Helv=
etica;font-style:normal;font-variant:normal;font-weight:normal;letter-spaci=
ng:normal;line-height:normal;text-align:auto;text-indent:0px;text-transform=
:none;white-space:normal;word-spacing:0px;font-size:medium"><span style=3D"=
border-collapse:separate;color:rgb(0, 0, 0);font-family:Helvetica;font-styl=
e:normal;font-variant:normal;font-weight:normal;letter-spacing:normal;line-=
height:normal;text-indent:0px;text-transform:none;white-space:normal;word-s=
pacing:0px;font-size:medium"><div style=3D"word-wrap:break-word">


<span style=3D"border-collapse:separate;color:rgb(0, 0, 0);font-family:Helv=
etica;font-style:normal;font-variant:normal;font-weight:normal;letter-spaci=
ng:normal;line-height:normal;text-indent:0px;text-transform:none;white-spac=
e:normal;word-spacing:0px;font-size:medium"><div style=3D"word-wrap:break-w=
ord">


<div><div>-----------------</div><div>Aaron Morton</div><div>Freelance Cass=
andra Developer</div><div>@aaronmorton</div><div><a href=3D"http://www.thel=
astpickle.com" target=3D"_blank">http://www.thelastpickle.com</a></div></di=
v>


</div></span></div></span></span>
</div>

<br></div><div><div></div><div><div><div>On 10 Jul 2011, at 03:26, Yan Chun=
lu wrote:</div><br><blockquote type=3D"cite">I am running RF=3D2(I have cha=
nged it from 2-&gt;3 and back to 2) and 3 nodes and didn&#39;t running node=
 repair more than 10 days, did not aware of this is critical. =C2=A0I run n=
ode repair recently and one of the node always hung... from log it seems do=
ing nothing related to the repair.<div>


<br></div><div>so I got two problems:</div><div><br></div><div>1) do I need=
 to treat every node as failure and do a rolling replacement? =C2=A0since t=
here might be some inconsistent in the cluster even I have no way to find o=
ut.</div>


<div>2) is that the reason that caused the node repair hung? the log messag=
e says:</div><div><div>Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Chec=
ker-run</div><div>WARNING: Failed to check the connection: java.net.SocketT=
imeoutException: Read timed out</div>


<div><br></div><div>then nothing.</div><div><br></div><div>thanks!</div><br=
><div class=3D"gmail_quote">On Sat, Jul 9, 2011 at 10:16 PM, Peter Schuller=
 <span dir=3D"ltr">&lt;<a href=3D"mailto:peter.schuller@infidyne.com" targe=
t=3D"_blank">peter.schuller@infidyne.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div>&gt;&gt; - Have you been running repair=
 consistently ?<br>
&gt;<br>
&gt; Nop, only when something breaks<br>
<br>
</div>This is unrelated to the problem you were asking about, but if you<br=
>
never run delete, make sure you are aware of:<br>
<br>
<a href=3D"http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetoo=
l_repair" target=3D"_blank">http://wiki.apache.org/cassandra/Operations#Fre=
quency_of_nodetool_repair</a><br>
<a href=3D"http://wiki.apache.org/cassandra/DistributedDeletes" target=3D"_=
blank">http://wiki.apache.org/cassandra/DistributedDeletes</a><br>
<br>
<br>
--<br>
<font color=3D"#888888">/ Peter Schuller<br>
</font></blockquote></div><br><br clear=3D"all"><br>-- <br>=E9=97=AB=E6=98=
=A5=E8=B7=AF<br>
</div>
</blockquote></div><br></div></div></div></div></blockquote></div><br><br c=
lear=3D"all"><br>-- <br>Charles
</div>

--00504502d234beb07c04a7c1fad7--