accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Elser (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3772) WAL prematurely deleted for root table
Date Mon, 04 May 2015 23:03:06 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14527513#comment-14527513
] 

Josh Elser commented on ACCUMULO-3772:
--------------------------------------

Using JIRA as my pastebin once again. Looking at the global view of this WAL across all logs:

We made the WAL

{noformat}
2015-05-04 08:09:58,433 [log.DfsLogger] DEBUG: Got new write-ahead log: os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:09:58,445 [log.TabletServerLogger] DEBUG: Created next WAL hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
{noformat}

We later had a write come in (maybe an assignment) which added it to root's and meta's set
of wals

{noformat}
2015-05-04 08:11:50,902 [tserver.TabletServer] INFO : Writing log marker for level ROOT hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:11:50,903 [util.MetadataTableUtil] DEBUG: Adding log entry hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:15:34,905 [tserver.TabletServer] INFO : Writing log marker for level META hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:15:34,906 [util.MetadataTableUtil] DEBUG: Adding log entry hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
{noformat}

And then compacted the memory and nuked the log references

{noformat}
2015-05-04 08:15:35,850 [tablet.Tablet] DEBUG: Logs for memory compacted: +r<< os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:15:35,851 [tablet.Tablet] DEBUG: Logs to be destroyed: +r<< os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:19:04,763 [tablet.Tablet] DEBUG: Logs for memory compacted: !0<;~ os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:19:04,764 [tablet.Tablet] DEBUG: Logs to be destroyed: !0<;~ os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997/hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
{noformat}

Next thing, the was restarted in secure mode and the WAL was deleted

{noformat}
2015-05-04 08:30:15,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking replication table
for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:30:15,223 [gc.GarbageCollectWriteAheadLogs] DEBUG: WAL not needed for replication
hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 08:30:15,268 [gc.GarbageCollectWriteAheadLogs] DEBUG: Removing unused WAL for server
os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000] log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
{noformat}

At this point, the tests actually started up. So, we started doing things to Accumulo (CleanTmpIT
to specific). This test writes a fake {{*_tmp}} to HDFS in a Tablet's directory for a table
it created and wrote data to. The test then restarts the TabletServers and checks to make
sure that this file is gone. The test passed, and then proceeded to go through the test teardown
which includes deleting the tables it created. The test hangs here.

When recovery kicked in after the restart, we saw the WAL was still in use.

{noformat}
2015-05-04 09:00:18,135 [master.Master] DEBUG: logs for dead servers: {os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004]=[hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f-f90d-4542-ab24-3ebdfa5d6fe4,
hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2,
hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426-f295-40d7-888e-f4b7dde1f966]}
2015-05-04 09:00:18,157 [state.ZooStore] DEBUG: Wrote +rM'�h<os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997�hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
to /accumulo/90ea5016-3bc6-479c-b84d-02bd26cf85a6/root_tablet/walogs/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 09:00:18,158 [state.ZooStore] DEBUG: Removing /root_tablet/current_logs/~wal+os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004]088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 09:00:18,298 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 09:00:18,332 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
2015-05-04 09:00:18,334 [recovery.RecoveryManager] INFO : Starting recovery of hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
(in : 10s), tablet +r<< holds a reference
{noformat}

It appears that there was _some_ disconnect between whether or not this WAL was in use.

> WAL prematurely deleted for root table
> --------------------------------------
>
>                 Key: ACCUMULO-3772
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3772
>             Project: Accumulo
>          Issue Type: Bug
>          Components: gc, tserver
>         Environment: SHA1 2436eec237dc23608e91cae56777a323e2f77295 (I believe). Running
with Kerberos on. HDFS 2.7.1-SNAPSHOT
>            Reporter: Josh Elser
>            Assignee: Josh Elser
>            Priority: Blocker
>             Fix For: 1.7.0
>
>
> I have a 5-node test system which I ran all of the ITs against.
> When I came back to the cluster later, I observed the following:
> * Multiple running MinCs against accumulo.metadata that weren't running
> * {{accumulo.root}} without an online tablet
> * Multiple queued FATE ops (PopulateMetadata blocked).
> Ultimately, found that recovery was repeatedly failing (without any notification on the
monitor, mind you). 
> {noformat}
> 2015-05-04 21:44:16,151 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426-
> f295-40d7-888e-f4b7dde1f966
> 2015-05-04 21:44:16,151 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f-
> f90d-4542-ab24-3ebdfa5d6fe4
> 2015-05-04 21:44:16,152 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-
> f19e-4f43-bfc6-82f978b8ccb2
> 2015-05-04 21:44:16,152 [state.ZooTabletStateStore] DEBUG: Returning root tablet state:
+r<<@(null,null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004])
> 2015-05-04 21:44:16,152 [master.Master] DEBUG: Root Table location State: +r<<@(null,null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004])
> 2015-05-04 21:44:16,152 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426-f295-4
> 0d7-888e-f4b7dde1f966 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/fc0e8426-f295-40d7-888e-f4b7dde1f966
> 2015-05-04 21:44:16,171 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f-f90d-4
> 542-ab24-3ebdfa5d6fe4 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/a1a2c22f-f90d-4542-ab24-3ebdfa5d6fe4
> 2015-05-04 21:44:16,174 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4
> f43-bfc6-82f978b8ccb2 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
> {noformat}
> Note *088e3c39-f19e-4f43-bfc6-82f978b8ccb2*. Eventually we see that the file is gone:
> {noformat}
> 2015-05-04 21:45:15,647 [recovery.RecoveryManager] DEBUG: Unable to initate log sort
for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2:
java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
>         at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
>         at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2828)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:666)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:663)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
> {noformat}
> The most likely candidate to direct hatred towards is the GC
> {noformat}
> [root@os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2 accumulo]# fgrep 088e3c39-f19e-4f43-bfc6-82f978b8ccb2
gc_os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal.debug.log
> 2015-05-04 08:30:15,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking replication
table for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
> 2015-05-04 08:30:15,223 [gc.GarbageCollectWriteAheadLogs] DEBUG: WAL not needed for replication
hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
> 2015-05-04 08:30:15,268 [gc.GarbageCollectWriteAheadLogs] DEBUG: Removing unused WAL
for server os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000] log
hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2
> {noformat}
> Sure enough, the GC deleted the file. I'll be digging from here...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message