"Dropped read message" might be an indicator of capacity issue. We experienced the similar issue with 0.7.6.

We ended up adding two extra nodes and physically rebooted the offending node(s).

The entire cluster then calmed down.

On Thu, Jul 28, 2011 at 2:24 PM, Yan Chunlu <springrider@gmail.com> wrote:
I have three nodes and RF=3.here is the current ring:

Address Status State Load Owns Token

node1 Up Normal 15.32 GB 81.09% 52773518586096316348543097376923124102
node2 Up Normal 22.51 GB 10.48% 70597222385644499881390884416714081360
node3 Up Normal 56.1 GB 8.43% 84944475733633104818662955375549269696

it is very un-balanced and I would like to re-balance it using
"nodetool move" asap. unfortunately I haven't been run node repair for
a long time.

aaron suggested it's better to run node repair on every node then re-balance it.

problem is the node3 is in heavy-load currently, and the entire
cluster slow down if I start doing node repair. I have to
disablegossip and disablethrift to stop the repair.

only cassandra running on that server and I have no idea what it was
doing. the cpu load is about 20+ currently. compcationstats and
netstats shows it was not doing anything.

I have change client to not to connect to node3, but still, it seems
in heavy load and io utils is 100%.

the log seems normal(although not sure what about the "Dropped read
message" thing):

 INFO 13:21:38,191 GC for ParNew: 345 ms, 627003992 reclaimed leaving
2563726360 used; max is 4248829952
 WARN 13:21:38,560 Dropped 826 READ messages in the last 5000ms
 INFO 13:21:38,560 Pool Name                    Active   Pending
 INFO 13:21:38,560 ReadStage                         8      7555
 INFO 13:21:38,561 RequestResponseStage              0         0
 INFO 13:21:38,561 ReadRepairStage                   0         0

is there anyway to tell what node3 was doing? or at least is there any
way to make it not slowdown the whole cluster?

Frank Duan
c: 703.869.9951