cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jonathan Ellis (JIRA)" <>
Subject [jira] [Resolved] (CASSANDRA-5367) Hints stuck on compaction
Date Wed, 20 Mar 2013 16:41:16 GMT


Jonathan Ellis resolved CASSANDRA-5367.

    Resolution: Fixed

If it's still making progress, but slowly because of the activity on the target, then there's
not much we can do, it's working as designed.

I guess you could submit a patch to try to detect when the recipient is slower than expected,
and abort so we can try to deliver other stuff instead, but it's a pretty small gain you're
shooting for.

Wontfixing since if you're constantly generating hints for many hosts you're underprovisioned
and should fix that instead of making the worst case slightly less bad.

If you want to submit code for this go ahead and reopen at that time.
> Hints stuck on compaction
> -------------------------
>                 Key: CASSANDRA-5367
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>    Affects Versions: 1.2.2
>         Environment: 80 Node cluster on 1.2.2 (problem has been around since before 1.0)
>            Reporter: Brooke Bryan
>         Attachments: thread.log
> When our cluster is handling hints, we will very often see hints get stuck on nodes if
it is unable to communicate with another node.  The problem is not that the other node is
down, the other node will be sat doing compactions, or running out of memory.  While that
node is a problem, and needs to be fixed, all other nodes on the cluster will stick waiting
to handle hints between that node and itself.
> This causes a pretty major knock on effect throughout the entire cluster, causing hints
to back up.  We are seeing some nodes backed up with 14GB of hints, after 2 days of the hints
being stuck.
> Also, during this "stuck" session, compactionstats will show a compaction on the system
hints column family, and not change the completed bytes amount.
> This is the only reason for an entire cluster to get very bogged down from what I have
experienced, and requires a lot of manual intervention to get everything back online.
> After putting a node into debug mode, I have narrowed down the issue to be within:
> startColumn =; (line ~361 HintedHandoffManager) and line 390
> based on the log output, and through pausing handoffs etc.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see:

View raw message