cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paulo Motta (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-10171) Windows dtest 3.0:
Date Wed, 26 Aug 2015 16:15:45 GMT


Paulo Motta commented on CASSANDRA-10171:

{{simple_repair_test}} and {{interrupt_build_process_test}} seem to have been fixed on [#29|]
after [e6a9afbb8a759fefc83334e470f5b8965f12a467|].
SInce these tests do not need hint functionality, I [disabled|]
hinted handoff for those tests, similar to what is done in other tests of the same class.

Both {{complex_repair_test}} and {{really_complex_repair_test}} were flakey on [Linux|]
and consistently failing on [Windows|]
due to a timing problem explained in more detailed as follows. These tests had the following
* 3 nodes
* RF=3
* node2 and node3 were stopped and the base table of the MV was updated on node1
* since materialized views require batch writes, that requires at least an additional live
node to store batchlogs, node4 was created in dc2 with rf=0 to fulfifll that requirement

However, batchlog endpoints [must be in the same datacenter|],
otherwise the batchlog request cannot succeed. So why were the tests passing, since the only
other alive node (node4) was in another data center?

Well, there is a [60 seconds|]
window before the topology file is reloaded where node4 was considered to be from the default
datacenter (dc1), so inserts would succeed and the test was passing in fast enough nodes.
However, in slower nodes (such as slower linux nodes or win32 nodes), the topology file would
be reloaded after 60s, and node4 would be considered from dc2, so the batchlog write fails

{noformat}code=1000 [Unavailable exception] message="Cannot achieve consistency level ONE"
info={'required_replicas': 1, 'alive_replicas': 0, 'consistency': 'ONE'}{noformat}

In addition to that, {{complex_repair_test}} was passing even after the {{repair()}} statements
were removed, because the {{ALL}} consistency level was being used, always retrieving the
most recent updates regardless if all nodes were consistent or not.

In order to address these issues I did a refactoring in both {{complex_repair_test}} and {{really_complex_repair_test}}
while maintaing the essence of the tests. The most significant changes were:
 * Used 5 nodes and RF=5, to have a quorum of 3 and a subquorum of 2. This allowed to achieve
the min number of 2 replicas for batchlogs while maitaining 2 separate partitions to test
 * Set the gc_grace_seconds of the base table to 1 second (It's not possible to set it to
zero), to guarantee batchlogs would expire and there would be a mismatch between partitions
before repair.
* Used CL {{QUORUM}} instead of {{ALL}} to verify inconsistencies.

The refactoring is available for review on this [cassandra-dtest PR|].
Adding [~aboudreault] as reviewer.

> Windows dtest 3.0:
> -------------------------------------------------------------------
>                 Key: CASSANDRA-10171
>                 URL:
>             Project: Cassandra
>          Issue Type: Sub-task
>            Reporter: Paulo Motta
>            Assignee: Paulo Motta
>             Fix For: 3.0.x
> The following 3.0 dtests have been failing [consistently|]
on Windows:
> * materialized_views_test.TestMaterializedViews.complex_repair_test
> * materialized_views_test.TestMaterializedViews.interrupt_build_process_test
> * materialized_views_test.TestMaterializedViews.really_complex_repair_test
> * materialized_views_test.TestMaterializedViews.simple_repair_test

This message was sent by Atlassian JIRA

View raw message