Return-Path: X-Original-To: apmail-cassandra-user-archive@www.apache.org Delivered-To: apmail-cassandra-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id E7D1947F4 for ; Mon, 11 Jul 2011 02:31:58 +0000 (UTC) Received: (qmail 38181 invoked by uid 500); 11 Jul 2011 02:31:56 -0000 Delivered-To: apmail-cassandra-user-archive@cassandra.apache.org Received: (qmail 37898 invoked by uid 500); 11 Jul 2011 02:31:53 -0000 Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@cassandra.apache.org Delivered-To: mailing list user@cassandra.apache.org Received: (qmail 37890 invoked by uid 99); 11 Jul 2011 02:31:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jul 2011 02:31:52 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of springrider@gmail.com designates 209.85.215.172 as permitted sender) Received: from [209.85.215.172] (HELO mail-ey0-f172.google.com) (209.85.215.172) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 11 Jul 2011 02:31:46 +0000 Received: by eye13 with SMTP id 13so1415235eye.31 for ; Sun, 10 Jul 2011 19:31:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=+FJMqX/dVkTLQNWeMTT07ouUMv5iyeIS4k6meezyviY=; b=logCZ5VFuM1DMJcwk+Wu3a+hwprkyw8YR0xu2nvTENc4uydDRNVUsamAqWwQ6qPIJi cQP0qC0KvspNjssK7OmGoEz2X1T0NmkSbxBXax0vTfP3anf0qUebffwZUlFNPPIoIHOS 5KkSoPaH9TfEB2tnTnqoSzT2AzOqsg47ixL/k= Received: by 10.213.107.15 with SMTP id z15mr1399448ebo.114.1310351484170; Sun, 10 Jul 2011 19:31:24 -0700 (PDT) MIME-Version: 1.0 Received: by 10.213.15.138 with HTTP; Sun, 10 Jul 2011 19:31:04 -0700 (PDT) In-Reply-To: <45543932-E5DF-48BD-952D-F99EA0AA9EC1@thelastpickle.com> References: <1310143134.5666.1.camel@Avalon> <137920FE-1CF4-42E9-950E-6B7544B0662D@thelastpickle.com> <1310189845.1935.1.camel@Avalon> <45543932-E5DF-48BD-952D-F99EA0AA9EC1@thelastpickle.com> From: Yan Chunlu Date: Mon, 11 Jul 2011 10:31:04 +0800 Message-ID: Subject: Re: Corrupted data To: user@cassandra.apache.org Content-Type: multipart/alternative; boundary=00504502d234beb07c04a7c1fad7 X-Virus-Checked: Checked by ClamAV on apache.org --00504502d234beb07c04a7c1fad7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable oh the error seems from jmx sorry but seems I dont have more error messages, the node repair just never ends... and strace the process find out nothing, it is not doing anything. is there anyway to get more information about this? do I need to do a majo= r compaction on every column family? thanks! On Mon, Jul 11, 2011 at 1:36 AM, aaron morton wrot= e: > 1) do I need to treat every node as failure and do a rolling replacement? > since there might be some inconsistent in the cluster even I have no way= to > find out. > > see > http://wiki.apache.org/cassandra/Operations#Dealing_with_the_consequences= _of_nodetool_repair_not_running_within_GCGraceSeconds > > > > > 2) is that the reason that caused the node repair hung? the log message > says: > Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Checker-run > WARNING: Failed to check the connection: java.net.SocketTimeoutException: > Read timed out > > I cannot find that anywhere in the code base, can you provide some more > information ? > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 10 Jul 2011, at 03:26, Yan Chunlu wrote: > > I am running RF=3D2(I have changed it from 2->3 and back to 2) and 3 node= s > and didn't running node repair more than 10 days, did not aware of this i= s > critical. I run node repair recently and one of the node always hung... > from log it seems doing nothing related to the repair. > > so I got two problems: > > 1) do I need to treat every node as failure and do a rolling replacement? > since there might be some inconsistent in the cluster even I have no way= to > find out. > 2) is that the reason that caused the node repair hung? the log message > says: > Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Checker-run > WARNING: Failed to check the connection: java.net.SocketTimeoutException: > Read timed out > > then nothing. > > thanks! > > On Sat, Jul 9, 2011 at 10:16 PM, Peter Schuller < > peter.schuller@infidyne.com> wrote: > >> >> - Have you been running repair consistently ? >> > >> > Nop, only when something breaks >> >> This is unrelated to the problem you were asking about, but if you >> never run delete, make sure you are aware of: >> >> http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair >> http://wiki.apache.org/cassandra/DistributedDeletes >> >> >> -- >> / Peter Schuller >> > > > > -- > =E9=97=AB=E6=98=A5=E8=B7=AF > > > --=20 Charles --00504502d234beb07c04a7c1fad7 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable oh the error seems from jmx


sorry but seems I do= nt have more error messages, the node repair just never ends... and strace = the process find out nothing, it is not doing anything.

is there anyway to get more information about this? =C2=A0do I need to= do a major compaction on every column family? thanks!

On Mon, Jul 11, 2011 at 1:36 AM, aaron morton <= aaron@thelastpickle.com> wrote:
1) do I need to treat every node as failure and= do a rolling replacement? =C2=A0since there might be some inconsistent in = the cluster even I have no way to find out.
see=C2=A0http://wiki.apache.org/cassandra/Operati= ons#Dealing_with_the_consequences_of_nodetool_repair_not_running_within_GCG= raceSeconds

2) is that th= e reason that caused the node repair hung? the log message says:
Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Checker-run
=
WARNING: Failed to check the connection: java.net.SocketTimeoutExcepti= on: Read timed out
I cannot find that an= ywhere in the code base, can you provide some more information ?=C2=A0

Cheers

-----------------
Aaron Morton
Freelance Cass= andra Developer
@aaronmorton

On 10 Jul 2011, at 03:26, Yan Chun= lu wrote:

I am running RF=3D2(I have cha= nged it from 2->3 and back to 2) and 3 nodes and didn't running node= repair more than 10 days, did not aware of this is critical. =C2=A0I run n= ode repair recently and one of the node always hung... from log it seems do= ing nothing related to the repair.

so I got two problems:

1) do I need= to treat every node as failure and do a rolling replacement? =C2=A0since t= here might be some inconsistent in the cluster even I have no way to find o= ut.
2) is that the reason that caused the node repair hung? the log messag= e says:
Jul 10, 2011 4:40:35 AM ClientCommunicatorAdmin Chec= ker-run
WARNING: Failed to check the connection: java.net.SocketT= imeoutException: Read timed out

then nothing.

thanks!
On Sat, Jul 9, 2011 at 10:16 PM, Peter Schuller= <peter.schuller@infidyne.com> wrote:
>> - Have you been running repair= consistently ?
>
> Nop, only when something breaks

This is unrelated to the problem you were asking about, but if you never run delete, make sure you are aware of:

http://wiki.apache.org/cassandra/Operations#Fre= quency_of_nodetool_repair
http://wiki.apache.org/cassandra/DistributedDeletes


--
/ Peter Schuller



--
=E9=97=AB=E6=98= =A5=E8=B7=AF




--
Charles
--00504502d234beb07c04a7c1fad7--