I know how this sounds, but upgrading to 1.1.11 is the best approach. 
1.0X is not getting any fixes, 1.1X is the most stable and still getting some patches, and 1.2 is stable and in use. 

Hint storage has been redesigned in 1.2. 

Any suggestions on how to make the cluster more tolerant to downtimes?
Hints are always seen as an optimisation, their success or otherwise does not impact the consistency guarantees. 

If are you dealing with a very high throughput as a work around you can reduce the time that hints are stored for a down node, see the yaml file for info. 

The behaviour is changes if you have lots of small or large column, this is the from HintedHandoff manager that selects the page size. 

        int pageSize = PAGE_SIZE;
        // read less columns (mutations) per page if they are very large
        if (hintStore.getMeanColumns() > 0)
            int averageColumnSize = (int) (hintStore.getMeanRowSize() / hintStore.getMeanColumns());
            pageSize = Math.min(PAGE_SIZE, DatabaseDescriptor.getInMemoryCompactionLimit() / averageColumnSize);
            pageSize = Math.max(2, pageSize); // page size of 1 does not allow actual paging b/c of >= behavior on startColumn
            logger_.debug("average hinted-row column size is {}; using pageSize of {}", averageColumnSize, pageSize);
If you reduce the in_memory_compaction_limit yaml setting that would reduce the page size 

Aaron Morton
Freelance Cassandra Consultant
New Zealand


On 21/05/2013, at 9:26 PM, Vladimir Volkov <vlad.volkov@gmail.com> wrote:


I'm stress-testing our Cassandra (version 1.0.9) cluster, and tried turning off two of the four nodes for half an hour under heavy load. As a result I got a large volume of hints on the alive nodes - HintsColumnFamily takes about 1.5 GB disk space on each of the nodes. It seems, these hints are never replayed successfully.

After I bring other nodes back online, tpstats shows active handoffs, but I can't see any writes on the target nodes.
The log indicates memory pressure - the heap is >80% full (heap size is 8GB total, 1GB young).

A fragment of the log:
 INFO 18:34:05,513 Started hinted handoff for token: 1 with IP: /
 INFO 18:34:06,794 GC for ParNew: 300 ms for 1 collections, 5974181760 used; max is 8588951552
 INFO 18:34:07,795 GC for ParNew: 263 ms for 1 collections, 6226018744 used; max is 8588951552
 INFO 18:34:08,795 GC for ParNew: 256 ms for 1 collections, 6559918392 used; max is 8588951552
 INFO 18:34:09,796 GC for ParNew: 231 ms for 1 collections, 6846133712 used; max is 8588951552
 WARN 18:34:09,805 Heap is 0.7978131149667941 full.  You may need to reduce memtable and/or cache sizes.  Cassandra will now flush up to the two largest memtables to free up memory.
 WARN 18:34:09,805 Flushing CFS(Keyspace='test', ColumnFamily='t2') to relieve memory pressure
 INFO 18:34:09,806 Enqueuing flush of Memtable-t2@639524673(60608588/571839171 serialized/live bytes, 743266 ops)
 INFO 18:34:09,807 Writing Memtable-t2@639524673(60608588/571839171 serialized/live bytes, 743266 ops)
 INFO 18:34:11,018 GC for ParNew: 449 ms for 2 collections, 6573394480 used; max is 8588951552
 INFO 18:34:12,019 GC for ParNew: 265 ms for 1 collections, 6820930056 used; max is 8588951552
 INFO 18:34:13,112 GC for ParNew: 331 ms for 1 collections, 6900566728 used; max is 8588951552
 INFO 18:34:14,181 GC for ParNew: 269 ms for 1 collections, 7101358936 used; max is 8588951552
 INFO 18:34:14,691 Completed flushing /mnt/raid/cassandra/data/test/t2-hc-244-Data.db (56156246 bytes)
 INFO 18:34:15,381 GC for ParNew: 280 ms for 1 collections, 7268441248 used; max is 8588951552
 INFO 18:34:35,306 InetAddress / is now dead.
 INFO 18:34:35,306 GC for ConcurrentMarkSweep: 19223 ms for 1 collections, 3774714808 used; max is 8588951552
 INFO 18:34:35,309 InetAddress / is now UP

After taking off the load and restatring the service, I still see pending handoffs:
$ nodetool -h localhost tpstats
Pool Name                    Active   Pending      Completed   Blocked  All time blocked
ReadStage                         0         0        1004257         0                 0
RequestResponseStage              0         0          92555         0                 0
MutationStage                     0         0              6         0                 0
ReadRepairStage                   0         0          57773         0                 0
ReplicateOnWriteStage             0         0              0         0                 0
GossipStage                       0         0         143332         0                 0
AntiEntropyStage                  0         0              0         0                 0
MigrationStage                    0         0              0         0                 0
MemtablePostFlusher               0         0              2         0                 0
StreamStage                       0         0              0         0                 0
FlushWriter                       0         0              2         0                 0
MiscStage                         0         0              0         0                 0
InternalResponseStage             0         0              0         0                 0
HintedHandoff                     1         3             15         0                 0

These 3 handoffs remain pending for a long time (>12 hours).
Most of the time Cassandra uses 100% of one CPU core, the stack trace of the busy thread is:
"HintedHandoff:1" daemon prio=10 tid=0x0000000001220800 nid=0x3843 runnable [0x00007fa1e1146000]
   java.lang.Thread.State: RUNNABLE
        at java.util.ArrayList$Itr.remove(ArrayList.java:808)
        at org.apache.cassandra.db.ColumnFamilyStore.removeDeletedSuper(ColumnFamilyStore.java:908)
        at org.apache.cassandra.db.ColumnFamilyStore.removeDeletedColumnsOnly(ColumnFamilyStore.java:857)
        at org.apache.cassandra.db.ColumnFamilyStore.removeDeleted(ColumnFamilyStore.java:850)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1195)
        at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1150)
        at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:324)
        at org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:256)
        at org.apache.cassandra.db.HintedHandOffManager.access$300(HintedHandOffManager.java:84)
        at org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:437)
        at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)

Heap usage is also rather high, though the node isn't doing anything, except the HH processing. Here is CMS output:
2013-05-20T22:22:59.812+0400: 4672.075: [GC[YG occupancy: 70070 K (943744 K)]4672.075: [Rescan (parallel) , 0.0224060 secs]4672.098: [weak refs processing, 0.0002900 secs]4672.098: [scrub string table, 0.0002670 secs] [1 CMS-remark: 5523830K(7340032K)] 5593901K(8283776K), 0.0231160 secs] [Times: user=0.28 sys=0.00, real=0.02 secs]

Eventually, after a few service restarts, the hints suddenly disappear. Probably, the TTL expires and the hints get compacted away.

Currently my best guess is the following. Hinted handoffs are stored as supercolumns, with one row per target node. The service tries to read them entirely into memory for replay and fails, because the volume is too large to fit in the heap at once.
Then the TTL expires, and the service starts to delete old subcolumns during read. Since the underlying storage is a huge ArrayList, the deletion is inefficient and takes forever.

So, it seems there're two problems here.
1) Hints are not paged correctly and cause significant memory pressure - that's actually strange, since the same issue was supposedly addressed in https://issues.apache.org/jira/browse/CASSANDRA-1327 and https://issues.apache.org/jira/browse/CASSANDRA-3624;
2) Deletion of outdated hints doesn't work well for large hint volumes.

Any suggestions on how to make the cluster more tolerant to downtimes?

If I turn off the hinted handoff entirely, and manually run a repair after a downtime, will it restore all the data correctly?

Best regards, Vladimir