accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <>
Subject [jira] [Commented] (ACCUMULO-3249) New replication status message created for file that was already replicated
Date Wed, 22 Oct 2014 15:40:36 GMT


Josh Elser commented on ACCUMULO-3249:

Perhaps the problem is more simple than I'm making it: I just noticed that the file shouldn't
have been replicated if it weren't marked as "closed" which is presently set when the WAL
was either a part of log-recovery in which no mutations were recovered from the file for the
tablet, or no references exist to the WAL in the metadata table.

The former is an incorrect assertion. At the time which log-recovery runs, a tablet could
have a reference to a WAL for a table, while other tablets for that table could also reference
that WAL. Say the single tablet we're recovering doesn't happen to have any mutations, we
would mark that the WAL is closed because *this* tablet won't be using the WAL anymore. The
problem is that other tablets *could* reuse the WAL for themselves. This creates a situation
where we think a WAL is both closed and not closed.

Generally stated: a single Tablet cannot determine the still-in-use or finished (closed) state
for a WAL. The only way to correctly ascertain this information is from a complete view over
the metadata table (as currently done by the GC).

> New replication status message created for file that was already replicated
> ---------------------------------------------------------------------------
>                 Key: ACCUMULO-3249
>                 URL:
>             Project: Accumulo
>          Issue Type: Bug
>          Components: replication
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Critical
>             Fix For: 1.7.0
> Noticed a failure in UnorderedWorkAssignerReplicationIT.dataWasReplicatedToThePeerWithoutDrain
where the test timed out because a file never got replicated that we expected to.
> Digging into it:
> * File was queued for replication before the original tserver died
> * New tserver picked up the file to be replicated before recovery fully completed
> * Tserver completed replication to the peer before recovery fully completed (recovery
for metadata/replication succeeded, but not for all tables)
> * Master cleaned up replication records because it saw that the tserver recorded that
replication was completed.
> * When recovery finally completed, it wrote an empty closed marker back into the metadata
table (which is a precaution to make sure that know when a WAL is no longer referenced).
> As such, we had a entry for a file that we thought needed replication but was already
replicated. That's issue #1.
> For some reason yet, this also caused the master to get into a state where it believe
we needed to replicate the WAL but couldn't assign the WAL for replication (I believe the
master thought it was already assigned for replication) and thus that file was stuck in a
"pending-replication" phase and didn't proceed. Eventually the test timed out and failed.

This message was sent by Atlassian JIRA

View raw message