Mailing-List: contact user-help@cassandra.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@cassandra.apache.org
Received-SPF: pass (nike.apache.org: local policy)
From: Chris Baron <Chris.Baron@ip-soft.net>
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Tue, 15 Feb 2011 23:25:32 -0500
Subject: Hinted Handoff/GC Tuning Headache
Thread-Topic: Hinted Handoff/GC Tuning Headache
Thread-Index: AQHLzZGMGXWysJ55N0i3PaW/xxwaYg==
Message-ID: 
 <43AFA1D4CADA484797B9F3D58C86DB4901ABAB5739@NY1-EXMB-01.ip-soft.net>
Accept-Language: en-US
Content-Language: en-US
acceptlanguage: en-US
Content-Type: text/plain; charset="Windows-1252"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Recently upgraded my 8 node cluster from 0.6.6 to 0.7.0 (even more recently=
 0.7.1) for ExpiringColumn, among the many other spectacular improvements.
=20
Retuned the GC settings based on experience from 0.6.6 and new defaults.
=20
After about a week, two of the nodes were very far behind on minor compacti=
ons (2k+ SSTables per CF and growing, 20k+ pending compactions).  The SSTab=
le switch rate on these two nodes was about 10x higher than the other nodes=
.  I also observed rolling long pause deaths (Gossip saying node X is dead)=
, seemingly every three minutes one of the nodes would long pause GC.  I sa=
w this behavior also when I upgraded from 0.6.6 to 0.6.8, but I rolled back=
 to 0.6.6 because time did not allow for a deeper observation at that time.=
 (found this: https://issues.apache.org/jira/browse/CASSANDRA-1656)
=20
I eventually traced this behavior back to a nasty interaction between Hinte=
d Handoff and GC tuned for normal operating conditions. =20
=20
If I understand the code correctly, when a node replays a hint it reads the=
 hinted data directly from the application tables (read: my ColumnFamily). =
 If the replaying node happens to be to also be a replica it will resend th=
e entire row, even if only one column was mutated.  Because of the rolling =
GC pause deaths the HHs rarely succeeded and if they did it wasn=92t long b=
efore a new set of hints were recorded.
=20
Disabling Hinted Handoffs has fixed this problem, for me.
=20
Looking into intermittent GC issues further, the verbose gc log showed ParN=
ew promotion failures, so I conservatively lowered CMSInitiatingOccupancyFr=
action, MAX_NEWSIZE, and in_memory_compaction_limit_in_mb.  I=92m now seein=
g long CMS times (8000ms+) but no failures, which leads me to believe 6G he=
ap may be too large based on the current tuning.
=20
It=92s worth noting that I saw no increase in ColumnFamily WriteCount or St=
orageProxy.WriteOperations, only ColumnFamily MemtableColumnsCount and Memt=
ableDataSize were increasing very rapidly on the target node while HintedHa=
ndoffs were replaying.

--
Chris=