accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3320) Replication reference prematurely closed and removed.
Date Mon, 10 Nov 2014 17:43:34 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14205064#comment-14205064
] 

Josh Elser commented on ACCUMULO-3320:
--------------------------------------

Looks like I might have actually tried to stub this out a long time ago and never made it
work, too. A short term fix is to ask the tsevers what their active WALs are, and then take
that into account when closing WALs for replication. The downside of this approach is that
it involves communicating with every tserver (even though it's done in the GC which is a little
less painful).

Alternatively, when the tserver closes a WAL, it could record that fact somewhere for the
GC to find later. This could also reduce the RPC that the GC does clean up WALs presently.
Obviously this would be a much bigger change, and merit its own issue.

> Replication reference prematurely closed and removed.
> -----------------------------------------------------
>
>                 Key: ACCUMULO-3320
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3320
>             Project: Accumulo
>          Issue Type: Bug
>          Components: replication
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>             Fix For: 1.7.0
>
>
> Saw the following situation in MultiInstanceReplicationIT across GC, tserver and master.
03b6dad5-56c9-4f35-8daf-a444f3252038 appeared to never get cleaned up:
> {noformat}
> 2014-11-08 12:05:34,898 [tserver.Tablet] DEBUG: Logs to be destroyed: 2<< tserver:37732/file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
> 2014-11-08 12:05:34,904 [tserver.TabletServer] DEBUG: ScanSess tid 172.31.47.246:52263
!0 0 entries in 0.03 secs, nbTimes = [25 25 25.00 1] 
> 2014-11-08 12:05:34,914 [tserver.Tablet] DEBUG: Recording that data has been ingested
into 2<< using [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038,
file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/8866f067-ed63-46ec-9dd1-b8d2e8381af2]
> 2014-11-08 12:05:34,914 [util.ReplicationTableUtil] DEBUG: Updating replication status
for 2<< with [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038,
file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/8866f067-ed63-46ec-9dd1-b8d2e8381af2]
using [begin: 0 end: 0 infiniteEnd: true closed: false]
> 2014-11-08 12:05:35,008 [replication.CloseWriteAheadLogReferences] INFO : Found 1 WALs
referenced in metadata in 100.3 ms
> 2014-11-08 12:05:35,127 [replication.CloseWriteAheadLogReferences] DEBUG: Closing unreferenced
WAL (~replfile:/.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
stat:2 [] 5 false) in metadata table
> 2014-11-08 12:05:35,221 [replication.CloseWriteAheadLogReferences] INFO : Closed 1 WAL
replication references in replication table in 206.2 ms
> 2014-11-08 12:05:36,505 [tserver.Tablet] DEBUG: Logs to be destroyed: !0<;~ ip-172-31-47-246:37732/file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
> 2014-11-08 12:05:36,517 [replication.StatusMaker] DEBUG: Creating replication status
record for file:/.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038
on table 2 with [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 1415448333352].
> 2014-11-08 12:06:27,097 [util.ReplicationTableUtil] DEBUG: Updating replication status
for 2<< with [file:////.../accumulo/test/target/mini-tests/org.apache.accumulo.test.replication.MultiInstanceReplicationIT_dataWasReplicatedToThePeer/accumulo/wal/tserver+37732/03b6dad5-56c9-4f35-8daf-a444f3252038]
using [begin: 0 end: 0 infiniteEnd: true closed: false]
> {noformat}
> This is problematic due to cross-process interaction. The GC happened to run just after
the tserver performed a compaction and removed the log reference from the tablet. Thus, at
a very small point in time, the WAL was not referenced by any tablets in the metadata table,
and, as such, the GC "closed" the WAL replication reference.
> The master saw that it was closed, cleaned up the reference and started replication.
The tserver continued to use the WAL (as it does) and placed some more updates into metadata.
> The other problem is that when the master saw that replication of the file was completed,
it removed the references from the replication table (as expected). However, when it went
to the next round of StatusMaker, it made a new record which had lost the fact that the old
file had been fully replicated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message