incubator-cassandra-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dan Kogan <...@iqtell.com>
Subject RE: Node went down and came back up
Date Mon, 06 May 2013 13:20:10 GMT
It seems that we did not have the JMX ports (1024+) opened in our firewall.  Once we opened
ports 1024+ the hinted handoffs completed and it seems that the cluster went back to normal.
Does that make sense?

Thanks,
Dan

This is what we saw in the logs after opening the ports:

INFO [HintedHandoff:1] 2013-05-05 14:52:41,925 ColumnFamilyStore.java (line 659) Enqueuing
flush of Memtable-HintsColumnFamily@726541064(33313153/41641441 serialized/live bytes, 18009
ops)
 INFO [FlushWriter:4] 2013-05-05 14:52:41,926 Memtable.java (line 264) Writing Memtable-HintsColumnFamily@726541064(33313153/41641441
serialized/live bytes, 18009 ops)
 INFO [FlushWriter:4] 2013-05-05 14:52:42,961 Memtable.java (line 305) Completed flushing
/data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-10-Data.db (33344642
bytes) for commitlog position ReplayPosition(segmentId=1367725930067, position=12449833)
 INFO [CompactionExecutor:16] 2013-05-05 14:52:42,969 CompactionTask.java (line 109) Compacting
[SSTableReader(path='/data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-10-Data.db'),
SSTableReader(path='/data/cassandra/data/system/HintsColumnFamily/system-HintsColumnFamily-he-9-Data.db')]
 INFO [HintedHandoff:1] 2013-05-05 14:52:43,419 HintedHandOffManager.java (line 390) Finished
hinted handoff of 7945 rows to endpoint /107.20.45.6


-----Original Message-----
From: Dan Kogan [mailto:dan@iqtell.com] 
Sent: Sunday, May 05, 2013 8:24 AM
To: user@cassandra.apache.org
Subject: Node went down and came back up

Hello,

Last night one of our nodes froze and the server had to be rebooted.  After it came up, the
node joined the ring and everything looked normal.
However, this morning there seem to be some inconsistencies in the data (e.g. some nodes don't
have a given record or have a different version of the record than other node).

There are also a lot of messages about hinted handoff in the logs that started after the node
failure.
Like these:

INFO [HintedHandoff:1] 2013-05-05 11:22:23,339 HintedHandOffManager.java (line 294) Started
hinted handoff for token: 56713727820156410577229101238628035242 with IP: /107.20.45.6  INFO
[HintedHandoff:1] 2013-05-05 11:22:33,343 HintedHandOffManager.java (line 372) Timed out replaying
hints to /107.20.45.6; aborting further deliveries  INFO [HintedHandoff:1] 2013-05-05 11:22:33,344
HintedHandOffManager.java (line 390) Finished hinted handoff of 0 rows to endpoint /107.20.45.6
 INFO [HintedHandoff:1] 2013-05-05 11:22:33,344 HintedHandOffManager.java (line 294) Started
hinted handoff for token: 0 with IP: /67.202.15.178  INFO [HintedHandoff:1] 2013-05-05 11:22:43,348
HintedHandOffManager.java (line 372) Timed out replaying hints to /67.202.15.178; aborting
further deliveries  INFO [HintedHandoff:1] 2013-05-05 11:22:43,348 HintedHandOffManager.java
(line 390) Finished hinted handoff of 0 rows to endpoint /67.202.15.178

Do we need to run repair on all nodes to get the cluster back to "normal" state?

Thanks for the help.

Dan Kogan
Mime
View raw message