cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jay Zhuang (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-13696) Digest mismatch Exception if hints file has UnknownColumnFamily
Date Wed, 19 Jul 2017 06:32:00 GMT


Jay Zhuang commented on CASSANDRA-13696:

Thanks [~jjirsa].
I did more investigation today. Seems it's more serious than I thought. Even there's no down
node, "drop table" while there's write traffic, it will trigger the problem.
Here is reproduce steps:
1. Create a 3 nodes cluster:
  {{$ ccm create test13696 -v 3.0.14 && ccm populate -n 3 && ccm start}}
2. Send some traffics with cassandra-stress (blogpost.yaml is only in trunk, if you use another
yaml file, change the RF=3)
  {{$ tools/bin/cassandra-stress user profile=test/resources/blogpost.yaml cl=QUORUM truncate=never
ops\(insert=1\) duration=30m -rate threads=2 -mode native cql3 -node}}
3. While the traffic is running, drop table
  {{$ cqlsh -e "drop table  stresscql.blogposts"}}
*All 3 nodes go down because of "Digest mismatch Exception".*

The CRC calculation problem has been there for a long time, but only got exposed after CASSANDRA-13004
because of the MessagingService version bump. In the normal case when the versions are the
same, HintsDispatcher uses {{[page.buffersIterator()|]}}
instead of {{[page.hintsIterator()|]}}.
{{buffersIterator()}} doesn't need to decode hints, so it won't have the problem.

I think the messagingVersion for the hints file should be updated:
so it could dispatch hints in an optimized way. Not sure if we need to check/bump other {{MessagingService.VERSION_30}}s
in the 3.0 branch.
cc [~ifesdjeen]

> Digest mismatch Exception if hints file has UnknownColumnFamily
> ---------------------------------------------------------------
>                 Key: CASSANDRA-13696
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Jay Zhuang
>            Assignee: Jay Zhuang
>            Priority: Critical
> {noformat}
> WARN  [HintsDispatcher:2] 2017-07-16 22:00:32,579 - Failed to read
a hint for / a2b7daf1-a6a4-4dfc-89de-32d12d2d48b0 - table with id 3882bbb0-6a71-11e7-9bca-2759083e3964
is unknown in file a2b7daf1-a6a4-4dfc-89de-32d12d2d48b0-1500242103097-1.hints
> ERROR [HintsDispatcher:2] 2017-07-16 22:00:32,580 - Failed
to dispatch hints file a2b7daf1-a6a4-4dfc-89de-32d12d2d48b0-1500242103097-1.hints: file is
corrupted ({})
> Digest mismatch exception
>     at org.apache.cassandra.hints.HintsReader$HintsIterator.computeNext(
>     at org.apache.cassandra.hints.HintsReader$HintsIterator.computeNext(
>     at org.apache.cassandra.utils.AbstractIterator.hasNext(
>     at org.apache.cassandra.hints.HintsDispatcher.sendHints(
>     at org.apache.cassandra.hints.HintsDispatcher.sendHintsAndAwait(
>     at org.apache.cassandra.hints.HintsDispatcher.dispatch(
>     at org.apache.cassandra.hints.HintsDispatcher.dispatch( ~[main/:na]
>     at org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.deliver(
>     at org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(
>     at org.apache.cassandra.hints.HintsDispatchExecutor$DispatchHintsTask.dispatch(
>     at org.apache.cassandra.hints.HintsDispatchExecutor$
>     at java.util.concurrent.Executors$ [na:1.8.0_111]
>     at [na:1.8.0_111]
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(
>     at java.util.concurrent.ThreadPoolExecutor$
>     at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(
>     at ~[na:1.8.0_111]
> Caused by: Digest mismatch exception
>     at org.apache.cassandra.hints.HintsReader$HintsIterator.computeNextInternal(
>     at org.apache.cassandra.hints.HintsReader$HintsIterator.computeNext(
>     ... 16 common frames omitted
> {noformat}
> It causes multiple cassandra nodes stop [by default|].
> Here is the reproduce steps on a 3 nodes cluster, RF=3:
> 1. stop node1
> 2. send some data with quorum (or one), it will generate hints file on node2/node3
> 3. drop the table
> 4. start node1
> node2/node3 will report "corrupted hints file" and stop. The impact is very bad for a
large cluster, when it happens, almost all the nodes are down at the same time and we have
to remove all the hints files (which contain the dropped table) to bring the node back.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message