cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ariel Weisberg (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10477) java.lang.AssertionError in StorageProxy.submitHint
Date Fri, 04 Dec 2015 17:14:11 GMT


Ariel Weisberg commented on CASSANDRA-10477:

bq. We're kind of dodging the hint "overload" protection on the paxos path as we don't use
sendToHintedEndpoints (which in particular makes the comment on commitPaxosLocal misleading
as it suggests otherwise). I think the simplest solution is to move the overload test from
sendToHintedEndpoints to some checkOverloaded() method and call that in commitPaxos too.
Which aspect of hint "overload" protection is missing? [I see it increments a counter which
I thought was the signal upstream.|]

Looking at it further is it because it doesn't throw {{OverloadedException}}? So a better
behavior would be to have the check and exception in a helper method and use that in commitPaxos()
so that it can now throw {{OverloadedException}}?

I do wonder what the unforeseen consequences of having {{CAS}} capable of throwing {{OE}}
is going to do that we haven't seen or tested before. Where this gets interesting is that
the read path now throws {{OE}} where it didn't before because apparently serial consistency
reads can end up calling {{beginAndRepairPaxos}}. I need to take a close look at how we test
this path to make sure it's going to behave well once exercised.

bq. In theory, we could still run into the problem of that ticket if OPTIMIZE_LOCAL_REQUESTS
is false. And in fact, I believe this option is unsafe since at least CASSANDRA-4753 as we
somewhat strongly assume writes to the localhost do not go through MessagingService. So I
would suggest ditching that option. Not only is it unsafe, but it's not used anywhere by the
code and it's hardcoded so you have to change the code and recompile to even use it (which
means I doubt anyone has even tried it in a long long time). And if we end up needing it in
the future, we'll have to figure out how to make it safe.
It's already removed from 2.2. Yeah I don't think anyone uses it.

bq. Why isn't the added assertion in WriteCallbackInfo on 3.0 not using !shouldHint lie in
the 2.1 patch?
It's an oversight from merging.

> java.lang.AssertionError in StorageProxy.submitHint
> ---------------------------------------------------
>                 Key: CASSANDRA-10477
>                 URL:
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local Write-Read Paths
>         Environment: CentOS 6, Oracle JVM 1.8.45
>            Reporter: Severin Leonhardt
>            Assignee: Ariel Weisberg
>             Fix For: 2.1.x, 2.2.x, 3.0.x, 3.x
> A few days after updating from 2.0.15 to 2.1.9 we have the following log entry on 2 of
5 machines:
> {noformat}
> ERROR [EXPIRING-MAP-REAPER:1] 2015-10-07 17:01:08,041 - Exception
in thread Thread[EXPIRING-MAP-REAPER:1,5,main]
> java.lang.AssertionError: /
>         at org.apache.cassandra.service.StorageProxy.submitHint(
>         at$5.apply(
>         at$5.apply(
>         at org.apache.cassandra.utils.ExpiringMap$ ~[apache-cassandra-2.1.9.jar:2.1.9]
>         at org.apache.cassandra.concurrent.DebuggableScheduledThreadPoolExecutor$
>         at java.util.concurrent.Executors$ [na:1.8.0_45]
>         at java.util.concurrent.FutureTask.runAndReset( [na:1.8.0_45]
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
>         at java.util.concurrent.ScheduledThreadPoolExecutor$
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
>         at java.util.concurrent.ThreadPoolExecutor$
>         at [na:1.8.0_45]
> {noformat}
> is the broadcast address of the local machine.
> When this is logged the read request latency of the whole cluster becomes very bad, from
6 ms/op to more than 100 ms/op according to OpsCenter. Clients get a lot of timeouts. We need
to restart the affected Cassandra node to get back normal read latencies. It seems write latency
is not affected.
> Disabling hinted handoff using {{nodetool disablehandoff}} only prevents the assert from
being logged. At some point the read latency becomes bad again. Restarting the node where
hinted handoff was disabled results in the read latency being better again.

This message was sent by Atlassian JIRA

View raw message