cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Jirsa (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (CASSANDRA-13196) test failure in snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address
Date Sat, 01 Apr 2017 23:37:41 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-13196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952462#comment-15952462
] 

Jeff Jirsa edited comment on CASSANDRA-13196 at 4/1/17 11:36 PM:
-----------------------------------------------------------------

Wouldn't be surprised if there was a race condition there - there was a not-dissimilar race
solved recently in CASSANDRA-12653 where the race was in setting up token metadata as the
node came out of shadow round, and this is fairly similar - we come out of shadow round at
{{2017-02-06 22:13:17,494}} , we submit the migration tasks at {{2017-02-06 22:13:20,622}}
and immediately {{2017-02-06 22:13:20,623}} decide not to send it, and FD finally sees the
nodes come up at {{2017-02-06 22:13:20,665}} - at the very least, I'm not sure why we'd even
try to submit the migration task knowing the instance was down - requeueing the schema pull
immediately on failure here definitely wouldn't have helped (we'd have failed to send it again,
as the instance was still down). I'm sort of wondering if this test still fails with #12653
committed - it doesn't seem like it's the exact same issue, but maybe the changes from 12653
also help with this race? 

I'm not sure of the history here, but it seems like [MigrationManager#shouldPullSchemaFrom|https://github.com/Gerrrr/cassandra/blob/463f3fecd9348ea0a4ce6eeeb30141527b8b10eb/src/java/org/apache/cassandra/schema/MigrationManager.java#L125]
could potentially check that endpoint's UP/DOWN in addition to messaging version.  



was (Author: jjirsa):
Wouldn't be surprised if there was a race condition there - there was a not-dissimilar race
solved recently in CASSANDRA-12653 where the race was in setting up token metadata as the
node came out of shadow round, and this is fairly similar - we come out of shadow round at
{{2017-02-06 22:13:17,494}} , we submit the migration tasks at {{2017-02-06 22:13:20,622}}
and immediately {{2017-02-06 22:13:20,623}} decide not to send it, and FD finally sees the
nodes come up at {{2017-02-06 22:13:20,665}} - at the very least, I'm not sure why we'd even
try to submit the migration task knowing the instance was down - requeueing the schema pull
immediately on failure here definitely wouldn't have helped (we'd have failed to send it again,
as the instance was still down).

I'm not sure of the history here, but it seems like [MigrationManager#shouldPullSchemaFrom|https://github.com/Gerrrr/cassandra/blob/463f3fecd9348ea0a4ce6eeeb30141527b8b10eb/src/java/org/apache/cassandra/schema/MigrationManager.java#L125]
could potentially check that endpoint's UP/DOWN in addition to messaging version.  


> test failure in snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13196
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13196
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Michael Shuler
>            Assignee: Aleksandr Sorokoumov
>              Labels: dtest, test-failure
>         Attachments: node1_debug.log, node1_gc.log, node1.log, node2_debug.log, node2_gc.log,
node2.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_dtest/1487/testReport/snitch_test/TestGossipingPropertyFileSnitch/test_prefer_local_reconnect_on_listen_address
> {code}
> {novnode}
> Error Message
> Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does not exist"
> -------------------- >> begin captured logging << --------------------
> dtest: DEBUG: cluster ccm directory: /tmp/dtest-k6b0iF
> dtest: DEBUG: Done setting configuration options:
> {   'initial_token': None,
>     'num_tokens': '32',
>     'phi_convict_threshold': 5,
>     'range_request_timeout_in_ms': 10000,
>     'read_request_timeout_in_ms': 10000,
>     'request_timeout_in_ms': 10000,
>     'truncate_request_timeout_in_ms': 10000,
>     'write_request_timeout_in_ms': 10000}
> cassandra.policies: INFO: Using datacenter 'dc1' for DCAwareRoundRobinPolicy (via host
'127.0.0.1'); if incorrect, please specify a local_dc to the constructor, or limit contact
points to local cluster nodes
> cassandra.cluster: INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered
> --------------------- >> end captured logging << ---------------------
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/snitch_test.py", line 87, in test_prefer_local_reconnect_on_listen_address
>     new_rows = list(session.execute("SELECT * FROM {}".format(stress_table)))
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 1998, in execute
>     return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile,
paging_state).result()
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 3784, in result
>     raise self._final_exception
> 'Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does not exist"\n--------------------
>> begin captured logging << --------------------\ndtest: DEBUG: cluster ccm directory:
/tmp/dtest-k6b0iF\ndtest: DEBUG: Done setting configuration options:\n{   \'initial_token\':
None,\n    \'num_tokens\': \'32\',\n    \'phi_convict_threshold\': 5,\n    \'range_request_timeout_in_ms\':
10000,\n    \'read_request_timeout_in_ms\': 10000,\n    \'request_timeout_in_ms\': 10000,\n
   \'truncate_request_timeout_in_ms\': 10000,\n    \'write_request_timeout_in_ms\': 10000}\ncassandra.policies:
INFO: Using datacenter \'dc1\' for DCAwareRoundRobinPolicy (via host \'127.0.0.1\'); if incorrect,
please specify a local_dc to the constructor, or limit contact points to local cluster nodes\ncassandra.cluster:
INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered\n--------------------- >>
end captured logging << ---------------------'
> {novnode}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message