cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam Tunnicliffe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CASSANDRA-8176) Intermittent NPE from RecoveryManagerTest RecoverPIT unit test
Date Wed, 29 Oct 2014 10:59:34 GMT

    [ https://issues.apache.org/jira/browse/CASSANDRA-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14188237#comment-14188237
] 

Sam Tunnicliffe commented on CASSANDRA-8176:
--------------------------------------------

I believe this is a problem which at the moment can only affect the tests. There's a pre-existing
race which the changes introduced by CASSANDRA-6904 make more obvious. 

The race is in each of the RecoveryManagerTest tests:

{noformat}
1 - [main] calls CommitLog.instance.resetUnsafe() -- clears CL active/avail segments
2 - [main] calls CommitLog.instance.recover() -- attempts to replay all unmanaged files in
the CL dir, then recycle them.
3 - [COMMIT-LOG-ALLOCATOR] detects that there are no available segments, so it:
    3a - creates a new segment
    3b - adds it to the list of available segments
{noformat}

The issue is when steps 2 and 3a/3b are interspersed. In that scenario, the new segment created
in 3a is replayed by the {{\[main\]}} thread and subsequently recycled whereby its file is
renamed. However, as well as being recycled it is added to the list of available segments
(3b). So when the next test in the suite runs, this segment is used when performing the initial
writes. The recycling has renamed the file but the segment id has not changed, so the checksums
written are calculated using the id. Now when this second test resets then tries to recover,
it discards the segment due to the checksum mismatches. Ultimately, that's what leads to the
NPE & the test failure as the mutations from that dropped segment are not replayed. This
race is highlighted by the changes in the commit [~mshuler] mentioned because now {{CL.recover()}}
scans the commit log dir twice and even though in this test the archive/restore calls are
no-ops, this widens the window for the race to occur.

So it's only the (ab)use of {{resetUnsafe()}} coupled with the way that {{recover()}} expects
the segments it replays to be unmanaged that causes this problem. The simplest fix right now
is to add another {{resetUnsafe()}} to the tests so that each starts with a clean CL, avoiding
the possibility of reusing a segment that has already been recycled.

This is something to keep in mind with regard to CASSANDRA-7232 if we start replaying commit
logs outside of a restart.

> Intermittent NPE from RecoveryManagerTest RecoverPIT unit test
> --------------------------------------------------------------
>
>                 Key: CASSANDRA-8176
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8176
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Tests
>            Reporter: Michael Shuler
>            Assignee: Sam Tunnicliffe
>             Fix For: 2.1.2
>
>         Attachments: RecoveryManagerTest_failure_system.log.gz
>
>
> {noformat}
>     [junit] Testsuite: org.apache.cassandra.db.RecoveryManagerTest
>     [junit] Tests run: 5, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 7.654 sec
>     [junit] 
>     [junit] ------------- Standard Output ---------------
>     [junit] WARN  16:40:38 No host ID found, created 2cbd54a8-79a5-40e0-a8e6-c8bf2c575877
(Note: This should happen exactly once per node).
>     [junit] WARN  16:40:38 No host ID found, created 2cbd54a8-79a5-40e0-a8e6-c8bf2c575877
(Note: This should happen exactly once per node).
>     [junit] WARN  16:40:38 Encountered bad header at position 16 of commit log /home/mshuler/git/cassandra/build/test/cassandra/commitlog:0/CommitLog-4-1414082433807.log,
with invalid CRC. The end of segment marker should be zero.
>     [junit] WARN  16:40:38 Encountered bad header at position 16 of commit log /home/mshuler/git/cassandra/build/test/cassandra/commitlog:0/CommitLog-4-1414082433807.log,
with invalid CRC. The end of segment marker should be zero.
>     [junit] ------------- ---------------- ---------------
>     [junit] Testcase: testRecoverPIT(org.apache.cassandra.db.RecoveryManagerTest):  
   Caused an ERROR
>     [junit] null
>     [junit] java.lang.NullPointerException
>     [junit]     at org.apache.cassandra.db.RecoveryManagerTest.testRecoverPIT(RecoveryManagerTest.java:129)
>     [junit] 
>     [junit] 
>     [junit] Test org.apache.cassandra.db.RecoveryManagerTest FAILED
> {noformat}
> Test fails roughly 20-25% of CI runs. Several 10x and 25x bisections for 2.1 {{git bisect
start cassandra-2.1 f03e505}} resulted in {noformat}first bad commit: [1394b128c65ef1ad59f765e9c9c5058cac04ca69]{noformat}
which is CASSANDRA-6904.
> That patch went to 2.0 and I still need to dig there to see if we're getting the same
error, but I've attached the unit test failure system.log from 2.1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message