accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Created] (ACCUMULO-3249) New replication status message created for file that was already replicated
Date Tue, 21 Oct 2014 20:40:33 GMT
Josh Elser created ACCUMULO-3249:

             Summary: New replication status message created for file that was already replicated
                 Key: ACCUMULO-3249
             Project: Accumulo
          Issue Type: Bug
          Components: replication
            Reporter: Josh Elser
            Assignee: Josh Elser
            Priority: Critical
             Fix For: 1.7.0

Noticed a failure in UnorderedWorkAssignerReplicationIT.dataWasReplicatedToThePeerWithoutDrain
where the test timed out because a file never got replicated that we expected to.

Digging into it:

* File was queued for replication before the original tserver died
* New tserver picked up the file to be replicated before recovery fully completed
* Tserver completed replication to the peer before recovery fully completed (recovery for
metadata/replication succeeded, but not for all tables)
* Master cleaned up replication records because it saw that the tserver recorded that replication
was completed.
* When recovery finally completed, it wrote an empty closed marker back into the metadata
table (which is a precaution to make sure that know when a WAL is no longer referenced).

As such, we had a entry for a file that we thought needed replication but was already replicated.
That's issue #1.

For some reason yet, this also caused the master to get into a state where it believe we needed
to replicate the WAL but couldn't assign the WAL for replication (I believe the master thought
it was already assigned for replication) and thus that file was stuck in a "pending-replication"
phase and didn't proceed. Eventually the test timed out and failed.

This message was sent by Atlassian JIRA

View raw message