I took the reset the world approach, things are much better now and the hints table is staying empty.  Bit disconcerting that it could get so large and not be able to recover itself, but at least there was a solution.  Thanks

 

 

From: aaron morton [mailto:aaron@thelastpickle.com]
Sent: Thursday, March 15, 2012 7:24 PM
To: user@cassandra.apache.org
Subject: Re: Large hints column family

 

These messages make it look like the node is having trouble delivering hints. 

INFO [HintedHandoff:1] 2012-03-13 16:13:34,188 HintedHandOffManager.java (line 284) Endpoint /192.168.20.4 died before hint delivery, aborting

INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 HintedHandOffManager.java (line 354) Timed out replaying hints to /192.168.20.3; aborting further deliveries

 

Take another look at the logs on this machine and on 20.4 and 20.3. 

 

I would be looking int why so many hints are been stored. GC ? are there also logs about dropped messages ? 

 

If you want to reset the world, make sure the nodes have all run repair and then drop the hints. Either via JMX or stopped in the node and deleting the files on disk. 

 

Cheers

 

-----------------

Aaron Morton

Freelance Developer

@aaronmorton

 

On 16/03/2012, at 12:58 PM, Bryce Godfrey wrote:



We were having some occasional memory pressure issues, but we just added some more RAM a few days ago to the nodes and things are running more smoothly now, but in general nodes have not been going up and down.

 

I tried to do a “list HintsColumnFamily” from Cassandra-cli and it locks my Cassandra node and never returns, forcing me to kill the Cassandra process and restart it to get the node back.

 

Here is my settings which I believe are default since I don’t remember changing them:

 

hinted_handoff_enabled: true

max_hint_window_in_ms: 3600000 # one hour

hinted_handoff_throttle_delay_in_ms: 50

 

Greping for Hinted in system log I get these

INFO [HintedHandoff:1] 2012-03-13 16:13:22,215 HintedHandOffManager.java (line 373) Finished hinted handoff of 852703 rows to endpoint /192.168.20.3

INFO [HintedHandoff:1] 2012-03-13 16:13:34,188 HintedHandOffManager.java (line 284) Endpoint /192.168.20.4 died before hint delivery, aborting

INFO [ScheduledTasks:1] 2012-03-13 16:15:32,569 StatusLogger.java (line 65) HintedHandoff                     1         1         0

INFO [HintedHandoff:1] 2012-03-13 16:15:44,362 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3

INFO [HintedHandoff:1] 2012-03-13 16:21:37,266 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3

INFO [ScheduledTasks:1] 2012-03-13 16:23:07,662 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-13 16:25:49,330 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-13 16:30:52,503 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-13 16:42:22,202 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 HintedHandOffManager.java (line 354) Timed out replaying hints to /192.168.20.3; aborting further deliveries

INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 ColumnFamilyStore.java (line 704) Enqueuing flush of Memtable-HintsColumnFamily@661547256(34298224/74465815 serialized/live bytes, 78808 ops)

INFO [HintedHandoff:1] 2012-03-13 17:11:00,098 HintedHandOffManager.java (line 373) Finished hinted handoff of 44160 rows to endpoint /192.168.20.3

INFO [HintedHandoff:1] 2012-03-13 17:11:36,596 HintedHandOffManager.java (line 296) Started hinted handoff for token: 56713727820156407428984779325531226112 with IP: /192.168.20.4

INFO [ScheduledTasks:1] 2012-03-13 17:12:25,248 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [HintedHandoff:1] 2012-03-13 18:47:56,151 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3

INFO [ScheduledTasks:1] 2012-03-13 18:50:24,326 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:12:48,177 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:13:57,685 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:14:57,258 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:14:58,260 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:15:59,093 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:16:59,428 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:18:01,862 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:18:01,898 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:19:04,527 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:19:04,541 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:20:07,712 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [ScheduledTasks:1] 2012-03-14 12:20:08,332 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [HintedHandoff:1] 2012-03-14 12:27:13,033 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3

INFO [ScheduledTasks:1] 2012-03-15 15:05:00,954 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [HintedHandoff:1] 2012-03-15 15:06:07,750 HintedHandOffManager.java (line 354) Timed out replaying hints to /192.168.20.3; aborting further deliveries

INFO [ScheduledTasks:1] 2012-03-15 15:06:07,802 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [HintedHandoff:1] 2012-03-15 15:06:07,809 ColumnFamilyStore.java (line 704) Enqueuing flush of Memtable-HintsColumnFamily@254668880(103911/8312880 serialized/live bytes, 63877 ops)

INFO [ScheduledTasks:1] 2012-03-15 15:07:13,503 StatusLogger.java (line 65) HintedHandoff                     1         2         0

INFO [HintedHandoff:1] 2012-03-15 15:15:43,842 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3

 

 

From: aaron morton [mailto:aaron@thelastpickle.com] 
Sent: Thursday, March 15, 2012 1:51 AM
To: 
user@cassandra.apache.org
Subject: Re: Large hints column family

 

Is there anything going on in the logs ? Are nodes going up and down ? Can you see any messages about delivering hints ? 

 

If the query to read the hints errors it will log "HintsCF getEPPendingHints timed out" at INFO level. 

 

Also checking, do the hinted_handoff_*  settings in cassandra.yaml have their default settings ?

 

Cheers

 

-----------------

Aaron Morton

Freelance Developer

@aaronmorton

 

On 15/03/2012, at 8:35 AM, Bryce Godfrey wrote:




Forgot to mention that this is on 1.0.8

 

From: Bryce Godfrey [mailto:Bryce.Godfrey@azaleos.com] 
Sent: Wednesday, March 14, 2012 12:34 PM
To: 
user@cassandra.apache.org
Subject: Large hints column family

 

The system HintsColumnFamily seems large in my cluster, and I want to track down why that is.  I try invoking “listEndpointsPendingHints()” for o.a.c.db.HintedHandoffManager and it never returns, and also freezes the node that its invoked against.  It’s a 3 node cluster, and all nodes have been up and running without issue for a while.  Any help on where to start with this?

 

               Column Family: HintsColumnFamily

                SSTable count: 11

                Space used (live): 11271669539

                Space used (total): 11271669539

                Number of Keys (estimate): 1408

                Memtable Columns Count: 338

                Memtable Data Size: 0

                Memtable Switch Count: 1

                Read Count: 3

                Read Latency: 4354.669 ms.

                Write Count: 848

                Write Latency: 0.029 ms.

                Pending Tasks: 0

                Bloom Filter False Postives: 0

                Bloom Filter False Ratio: 0.00000

                Bloom Filter Space Used: 12656

                Key cache capacity: 14

                Key cache size: 11

                Key cache hit rate: 0.6666666666666666

                Row cache: disabled

                Compacted row minimum size: 105779

                Compacted row maximum size: 7152383774

                Compacted row mean size: 590818614

 

Thanks,

Bryce