cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Dusbabek <gdusba...@gmail.com>
Subject Re: Hung Repair
Date Mon, 25 Oct 2010 13:32:08 GMT
Can you produce a thread dump on the machine?  kill -3 ought to do it.

JConsole can be your friend at a time like this too.  It might be
painstaking, but you can check the CPU time used by each thread using
the java.lang.Threading mbean.  There's an interesting jconsole plugin
that is supposed to make this easier:
http://lsd.luminis.nl/top-threads-plugin-for-jconsole/

Gary.


On Fri, Oct 22, 2010 at 16:42, Dan Hendry <dan.hendry.junk@gmail.com> wrote:
> I am currently running a 4 node cluster on Cassandra beta 2. Yesterday, I
> ran into a number of problems and the one of my nodes went down for a few
> hours. I tried to run a nodetool repair and at least at a data level,
> everything seems to be consistent and alright. The problem is that the node
> is still chewing up 100% of its available CPU, 20 hours after I started the
> repair. Load averages are 8-9 which is crazy given it is a single core ec2
> m1.small.
>
>
>
> Besides sitting at 100% cpu, the node on which I ran the repair seems to be
> fine. The Cassandra logs appear normal. Based on bandwidth patterns between
> nodes, it does not seem like they are transferring any repair related data
> (as they did initially). No pending tasks are being shown in any of the
> services when inspecting via jmx. I have a reasonable amount of data in the
> cluster (~6 gb * 2 replication factor) but nothing crazy. The last repair
> related entry in the logs is as follows:
>
>
>
> INFO [Thread-145] 2010-10-22 00:24:10,561 AntiEntropyService.java (line 828)
> #<TreeRequest manual-repair-23dacf4b-4076-4460-abd5-a713bfd090e2,
> /10.192.227.6, (kikmetrics,PacketEventsByPacket)> completed successfully: 14
> outstanding.
>
>
>
> Any idea what is going on? Could the CPU usage STILL be related to the
> repair? Is there any way to check? I hesitate to simply kill the node given
> the “14 outstanding” log message and as doing so has caused me problems in
> the past when using beta versions.
>
>
>
>
>
> Dan Hendry
>
>

Mime
View raw message