cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-10485) Missing host ID on hinted handoff write
Date Tue, 27 Oct 2015 13:19:27 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-10485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14975122#comment-14975122
] 

Paulo Motta edited comment on CASSANDRA-10485 at 10/27/15 1:18 PM:
-------------------------------------------------------------------

It seems pending endpoints are removed from the {{TokenMetadata}} before the new pending ranges
are calculated asynchronously by {{PendingRangeCalculatorService}}. For example, when {{StorageService}}
receives a notification that a node was removed:
{code:title=StorageService.java|borderStyle=solid}
public void onRemove(InetAddress endpoint)
{
    tokenMetadata.removeEndpoint(endpoint);
    PendingRangeCalculatorService.instance.update();
}
{code}

So, there's a window before the new pending ranges are calculated where removed pending endpoints
are returned to write operations by {{TokenMetadata.pendingEndpointsFor()}}, but since the
endpoint was already removed from the {{TokenMetadata}}, it's not possible to fetch the endpoint's
ID.

This seems to be confirmed by reports of this bug during node replacements or failed bootstraps
on CASSANDRA-6335 and CASSANDRA-10233. I also created a [simple test|https://github.com/pauloricardomg/cassandra/blob/9a4bb94f4a92ed20dba4b0c04e173c641d45251a/test/unit/org/apache/cassandra/locator/SimpleStrategyTest.java#L168]
confirming the issue.

The simple solution is to iterate over all pending ranges on {{TokenMetadata.removeEndpoint()}},
and remove the entries containing the removed endpoint. I assumed it's thread-safe to update
the {{pendingRanges}} backing {{HashMultiMap}} within the {{TokenMetadata}} write lock while
it's read by other threads due to this note on {{HashMultiMap}} [documentation|https://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/HashMultimap.html]:
bq. This class is not threadsafe when any concurrent operations update the multimap. Concurrent
read operations will work correctly. 

Code and tests below:

||2.1||2.2||3.0||trunk||
|[branch|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10485]|[branch|https://github.com/apache/cassandra/compare/cassandra-2.2...pauloricardomg:2.2-10485]|[branch|https://github.com/apache/cassandra/compare/cassandra-3.0...pauloricardomg:3.0-10485]|[branch|https://github.com/apache/cassandra/compare/trunk...pauloricardomg:trunk-10485]|
|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-testall/lastCompletedBuild/testReport/]|[testall|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-testall/lastCompletedBuild/testReport/]|
|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.1-10485-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-2.2-10485-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-3.0-10485-dtest/lastCompletedBuild/testReport/]|[dtests|http://cassci.datastax.com/view/Dev/view/paulomotta/job/pauloricardomg-trunk-10485-dtest/lastCompletedBuild/testReport/]|
{color:red}
*test still running
{color}


For the record: while investigating the issue I created an [alternative solution|https://github.com/apache/cassandra/compare/cassandra-2.1...pauloricardomg:2.1-10485-v3],
where endpoint removal is performed only after the new pending ranges are calculated, but
I think it's not an ideal solution because it will mean hints will be stored for a node that
is already known to be removed. Furthermore it may also break other places relying on the
assumption that the endpoint was immediately removed from {{TokenMetadata}} after the {{Gossiper}}
removal notification.


was (Author: pauloricardomg):
It seems pending endpoints are removed from the {{TokenMetadata}} before the new pending ranges
are calculated by {{StorageService}}:
{code:title=StorageService.java|borderStyle=solid}
public void onRemove(InetAddress endpoint)
{
    tokenMetadata.removeEndpoint(endpoint);
    PendingRangeCalculatorService.instance.update();
}
{code}

So, there's a window where nodes can be 

> Missing host ID on hinted handoff write
> ---------------------------------------
>
>                 Key: CASSANDRA-10485
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10485
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>
> when I restart one of them I receive the error "Missing host ID":
> {noformat}
> WARN  [SharedPool-Worker-1] 2015-10-08 13:15:33,882 AbstractTracingAwareExecutorService.java:169
- Uncaught exception on thread Thread[SharedPool-Worker-1,5,main]: {}
> java.lang.AssertionError: Missing host ID for 63.251.156.141
>         at org.apache.cassandra.service.StorageProxy.writeHintForMutation(StorageProxy.java:978)
~[apache-cassandra-2.1.3.jar:2.1.3]
>         at org.apache.cassandra.service.StorageProxy$6.runMayThrow(StorageProxy.java:950)
~[apache-cassandra-2.1.3.jar:2.1.3]
>         at org.apache.cassandra.service.StorageProxy$HintRunnable.run(StorageProxy.java:2235)
~[apache-cassandra-2.1.3.jar:2.1.3]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_60]
>         at org.apache.cassandra.concurrent.AbstractTracingAwareExecutorService$FutureTask.run(AbstractTracingAwareExecutorService.java:164)
~[apache-cassandra-2.1.3.jar:2.1.3]
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:105) [apache-cassandra-2.1.3.jar:2.1.3]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]
> {noformat}
> If I made nodetool status, the problematic node has ID:
> {noformat}
> UN  10.10.10.12  1.3 TB     1       ?       4d5c8fd2-a909-4f09-a23c-4cd6040f338a  rack3
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message