accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Created] (ACCUMULO-3320) Replication reference prematurely closed and removed.
Date Sat, 08 Nov 2014 20:53:33 GMT
Josh Elser created ACCUMULO-3320:
------------------------------------

             Summary: Replication reference prematurely closed and removed.
                 Key: ACCUMULO-3320
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3320
             Project: Accumulo
          Issue Type: Bug
          Components: replication
            Reporter: Josh Elser
            Assignee: Josh Elser
             Fix For: 1.7.0


Saw the following situation in MultiInstanceReplicationIT across GC, tserver and master. 03b6dad5-56c9-4f35-8daf-a444f3252038
appeared to never get cleaned up:

{noformat}
2014-11-08 12:05:34,898 [tserver.Tablet] DEBUG: Logs to be destroyed: 2<< tserver:37732/file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
2014-11-08 12:05:34,904 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.47.246:52263 !0
0 entries in 0.03 secs, nbTimes = [25 25 25.00 1] 
2014-11-08 12:05:34,914 [tserver.Tablet] DEBUG: Recording that data has been ingested into
2<< using [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038,
file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/8866f067-ed63-46ec-9dd1-b8d2e8381af2]
2014-11-08 12:05:34,914 [util.ReplicationTableUtil] DEBUG: Updating replication status for
2<< with [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038,
file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/8866f067-ed63-46ec-9dd1-b8d2e8381af2]
using [begin: 0 end: 0 infiniteEnd: true closed: false]

2014-11-08 12:05:35,008 [replication.CloseWriteAheadLogReferences] INFO : Found 1 WALs referenced
in metadata in 100.3 ms
2014-11-08 12:05:35,127 [replication.CloseWriteAheadLogReferences] DEBUG: Closing unreferenced
WAL (~replfile:/.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
stat:2 [] 5 false) in metadata table
2014-11-08 12:05:35,221 [replication.CloseWriteAheadLogReferences] INFO : Closed 1 WAL replication
references in replication table in 206.2 ms

2014-11-08 12:05:36,505 [tserver.Tablet] DEBUG: Logs to be destroyed: !0<;~ ip-172-31-47-246:37732/file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038

2014-11-08 12:05:36,517 [replication.StatusMaker] DEBUG: Creating replication status record
for file:/.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
on table 2 with [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 1415448333352].

2014-11-08 12:06:27,097 [util.ReplicationTableUtil] DEBUG: Updating replication status for
2<< with [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038]
using [begin: 0 end: 0 infiniteEnd: true closed: false]
{noformat}

This is problematic due to cross-process interaction. The GC happened to run just after the
tserver performed a compaction and removed the log reference from the tablet. Thus, at a very
small point in time, the WAL was not referenced by any tablets in the metadata table, and,
as such, the GC "closed" the WAL replication reference.

The master saw that it was closed, cleaned up the reference and started replication. The tserver
continued to use the WAL (as it does) and placed some more updates into metadata.

The other problem is that when the master saw that replication of the file was completed,
it removed the references from the replication table (as expected). However, when it went
to the next round of StatusMaker, it made a new record which had lost the fact that the old
file had been fully replicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message