Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: domain of shimi.k@gmail.com designates
 74.125.83.172 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=CnClBioio3rDkZYDZLPZr4C2UnXe8TfbGHjM0swNEm/eq8nYXwMlaycun6jCxFq0To
         6T5DG+senNE6imHy/KNg9IYxy6npcLNiCz63kPYOIMxXj1oc4QUunMqfkhZ3dT7wo8eI
         Q4INIQqYtpzK/472M5l1+T03YgE+wQkVyD0TA=
MIME-Version: 1.0
In-Reply-To: <AANLkTimm6P8vWXfDjpFZh9mMUxP_SGGm5i-LM7vRuwQ4@mail.gmail.com>
References: <AANLkTinor1L-7XkVd2K57kX6kEYKXwcS8Tf3zHL==bQU@mail.gmail.com>
	<AANLkTimm6P8vWXfDjpFZh9mMUxP_SGGm5i-LM7vRuwQ4@mail.gmail.com>
Date: Mon, 20 Sep 2010 20:04:36 +0200
Message-ID: <AANLkTi=S+UCv8j-ncB_Joo3yHHh5LwU=hve+mR+Pw7Mx@mail.gmail.com>
Subject: Re: FatClient Gossip error and some other problems
From: shimi <shimi.k@gmail.com>
To: user@cassandra.apache.org
Content-Type: multipart/alternative; boundary=0016364581b0cc8da00490b4be13

--0016364581b0cc8da00490b4be13
Content-Type: text/plain; charset=ISO-8859-1

I was patient (although it is hard when you have millions of requests which
are not served in time). I was waiting for a long time. There was nothing in
the Logs and in JMX.

Shimi

On Mon, Sep 20, 2010 at 6:12 PM, Gary Dusbabek <gdusbabek@gmail.com> wrote:

> On Mon, Sep 20, 2010 at 09:51, shimi <shimi.k@gmail.com> wrote:
> > I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter).
> > I replaced all of the servers in the cluster (0.6.4) with new ones
> (0.6.5).
> > My old cluster was unbalanced since I was using Random Partitioner and I
> > bootstrapped all the nodes without specifying their tokens.
> >
> > Since I wanted the the cluster to be balanced I first added all the new
> > nodes one after the other (with the right tokens this time) and then I
> run
> > decommission on all the old ones, one after the other.
> > One of the decommissioned nodes began throwing too many open files errors
> > while It was decommissioning taking other nodes with him. After the
> second
> > try I decided to stop it and run removetoken on his token from one of the
> > other nodes. After that everything went well except that in the end one
> of
> > the nodes looked unbalanced.
> >
> > I decided to run repair on the cluster. What I got is totally unbalanced
> > nodes with way to much data then what is suppose to be. each node had
> x2-x4
> > more data.
> > I run cleanup and all of them except the one which was unbalanced to
> begin
> > with got back to the size they were suppose to be.
> > Now whenever I try to run cleanup on this node I get:
> >
> >  INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java
> > (line 339) AntiCompacting ...
> >  INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129)
> GC
> > for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520
> used;
> > max is 6552551424
> >  INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150)
> > Pool Name                    Active   Pending
> >  INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
> > STREAM-STAGE                      0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
> > RESPONSE-STAGE                    0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156)
> > ROW-READ-STAGE                    8       717
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > LB-OPERATIONS                     0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > MISCELLANEOUS-POOL                0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
> > GMFD                              0         2
> >  INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
> > CONSISTENCY-MANAGER               0         1
> >  INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
> > LB-TARGET                         0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156)
> > ROW-MUTATION-STAGE                0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
> > MESSAGE-STREAMING-POOL            0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
> > LOAD-BALANCER-STAGE               0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156)
> > FLUSH-SORTER-POOL                 0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
> > MEMTABLE-POST-FLUSHER             0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
> > AE-SERVICE-STAGE                  0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
> > FLUSH-WRITER-POOL                 0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
> > HINTED-HANDOFF-POOL               0         0
> >  INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161)
> > CompactionManager               n/a         0
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730
> > SSTableDeletingReference.java (line 104) Deleted ...
> >  INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735
> > SSTableDeletingReference.java (line 104) Deleted ...
> >
> > and after that I saw an increase in the node response time and the number
> > ROW-READ-STAGE pending tasks. Since there was no indication that
> something
> > is wrong or that the node is doing anyuthing (logs ,nodetool and JMX),
> the
> > only thing that I could have done is to restart the server.
> >
> > I don't know if this is related but every hour I see this error (I think
> it
> > is the IP of the machine that I couldn't decommission properly):
> >
> >  INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402)
> FatClient
> > /X.X.X.X has been silent for 3600000ms, removing from gossip
> > ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip
> error
> > java.util.ConcurrentModificationException
> >     at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)
> >     at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
> >     at
> > org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
> >     at java.util.TimerThread.mainLoop(Timer.java:512)
> >     at java.util.TimerThread.run(Timer.java:462)
> >  INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node
> > /X.X.X.X is now part of the cluster
> >
> > Does anyone have any idea how can I cleanup the problematic node?
>
> You may just need to be patient.  Have you tried monitoring the
> CompactionManager in jmx to see if it is doing things?
>
> > Does anyone have any idea how can I get rid of the Gossip error?
>
> This is CASSANDRA-1494. You can ignore it.
>
> Gary.
>

--0016364581b0cc8da00490b4be13
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I was patient (although it is hard when you have millions =
of requests which are not served in time). I was waiting for a long time. T=
here was nothing in the Logs and in JMX. <br><br>Shimi<br><br><div class=3D=
"gmail_quote">
On Mon, Sep 20, 2010 at 6:12 PM, Gary Dusbabek <span dir=3D"ltr">&lt;<a hre=
f=3D"mailto:gdusbabek@gmail.com">gdusbabek@gmail.com</a>&gt;</span> wrote:<=
br><blockquote class=3D"gmail_quote" style=3D"margin: 0pt 0pt 0pt 0.8ex; bo=
rder-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<div><div></div><div class=3D"h5">On Mon, Sep 20, 2010 at 09:51, shimi &lt;=
<a href=3D"mailto:shimi.k@gmail.com">shimi.k@gmail.com</a>&gt; wrote:<br>
&gt; I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter).=
<br>
&gt; I replaced all of the servers in the cluster (0.6.4) with new ones (0.=
6.5).<br>
&gt; My old cluster was unbalanced since I was using Random Partitioner and=
 I<br>
&gt; bootstrapped all the nodes without specifying their tokens.<br>
&gt;<br>
&gt; Since I wanted the the cluster to be balanced I first added all the ne=
w<br>
&gt; nodes one after the other (with the right tokens this time) and then I=
 run<br>
&gt; decommission on all the old ones, one after the other.<br>
&gt; One of the decommissioned nodes began throwing too many open files err=
ors<br>
&gt; while It was decommissioning taking other nodes with him. After the se=
cond<br>
&gt; try I decided to stop it and run removetoken on his token from one of =
the<br>
&gt; other nodes. After that everything went well except that in the end on=
e of<br>
&gt; the nodes looked unbalanced.<br>
&gt;<br>
&gt; I decided to run repair on the cluster. What I got is totally unbalanc=
ed<br>
&gt; nodes with way to much data then what is suppose to be. each node had =
x2-x4<br>
&gt; more data.<br>
&gt; I run cleanup and all of them except the one which was unbalanced to b=
egin<br>
&gt; with got back to the size they were suppose to be.<br>
&gt; Now whenever I try to run cleanup on this node I get:<br>
&gt;<br>
&gt; =A0INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.=
java<br>
&gt; (line 339) AntiCompacting ...<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line=
 129) GC<br>
&gt; for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520=
 used;<br>
&gt; max is 6552551424<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line=
 150)<br>
&gt; Pool Name=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 Act=
ive=A0=A0 Pending<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line=
 156)<br>
&gt; STREAM-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line=
 156)<br>
&gt; RESPONSE-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line=
 156)<br>
&gt; ROW-READ-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0 8=A0=A0=A0=A0=A0=A0 717<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line=
 156)<br>
&gt; LB-OPERATIONS=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line=
 156)<br>
&gt; MISCELLANEOUS-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=
=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line=
 156)<br>
&gt; GMFD=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 2<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line=
 156)<br>
&gt; CONSISTENCY-MANAGER=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=
=A0=A0=A0=A0=A0=A0 1<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line=
 156)<br>
&gt; LB-TARGET=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=
=A0=A0=A0=A0 0=A0=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line=
 156)<br>
&gt; ROW-MUTATION-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=
=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line=
 156)<br>
&gt; MESSAGE-STREAMING-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=A0=
=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line=
 156)<br>
&gt; LOAD-BALANCER-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=
=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line=
 156)<br>
&gt; FLUSH-SORTER-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=
=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line=
 156)<br>
&gt; MEMTABLE-POST-FLUSHER=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=A0=
=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line=
 156)<br>
&gt; AE-SERVICE-STAGE=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=
=A0=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line=
 156)<br>
&gt; FLUSH-WRITER-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=
=A0=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line=
 156)<br>
&gt; HINTED-HANDOFF-POOL=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 0=A0=A0=
=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line=
 161)<br>
&gt; CompactionManager=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0=A0 n/a=A0=A0=
=A0=A0=A0=A0=A0=A0 0<br>
&gt; =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402<br>
&gt; SSTableDeletingReference.java (line 104) Deleted ...<br>
&gt; =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727<br>
&gt; SSTableDeletingReference.java (line 104) Deleted ...<br>
&gt; =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730<br>
&gt; SSTableDeletingReference.java (line 104) Deleted ...<br>
&gt; =A0INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735<br>
&gt; SSTableDeletingReference.java (line 104) Deleted ...<br>
&gt;<br>
&gt; and after that I saw an increase in the node response time and the num=
ber<br>
&gt; ROW-READ-STAGE pending tasks. Since there was no indication that somet=
hing<br>
&gt; is wrong or that the node is doing anyuthing (logs ,nodetool and JMX),=
 the<br>
&gt; only thing that I could have done is to restart the server.<br>
&gt;<br>
&gt; I don&#39;t know if this is related but every hour I see this error (I=
 think it<br>
&gt; is the IP of the machine that I couldn&#39;t decommission properly):<b=
r>
&gt;<br>
&gt; =A0INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) Fat=
Client<br>
&gt; /X.X.X.X has been silent for 3600000ms, removing from gossip<br>
&gt; ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip=
 error<br>
&gt; java.util.ConcurrentModificationException<br>
&gt; =A0=A0=A0 at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)<=
br>
&gt; =A0=A0=A0 at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.=
java:383)<br>
&gt; =A0=A0=A0 at<br>
&gt; org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93=
)<br>
&gt; =A0=A0=A0 at java.util.TimerThread.mainLoop(Timer.java:512)<br>
&gt; =A0=A0=A0 at java.util.TimerThread.run(Timer.java:462)<br>
&gt; =A0INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node=
<br>
&gt; /X.X.X.X is now part of the cluster<br>
&gt;<br>
&gt; Does anyone have any idea how can I cleanup the problematic node?<br>
<br>
</div></div>You may just need to be patient. =A0Have you tried monitoring t=
he<br>
CompactionManager in jmx to see if it is doing things?<br>
<div class=3D"im"><br>
&gt; Does anyone have any idea how can I get rid of the Gossip error?<br>
<br>
</div>This is CASSANDRA-1494. You can ignore it.<br>
<font color=3D"#888888"><br>
Gary.<br>
</font></blockquote></div><br></div>

--0016364581b0cc8da00490b4be13--