Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DDFA817A54 for ; Wed, 6 May 2015 03:39:52 +0000 (UTC) Received: (qmail 67498 invoked by uid 500); 6 May 2015 03:39:52 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 67462 invoked by uid 500); 6 May 2015 03:39:52 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 67251 invoked by uid 99); 6 May 2015 03:39:52 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2015 03:39:52 +0000 Date: Wed, 6 May 2015 03:39:52 +0000 (UTC) From: "Josh Elser (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-3772) WAL prematurely deleted for root table MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14529835#comment-14529835 ] Josh Elser commented on ACCUMULO-3772: -------------------------------------- Alright, I think I was making _way_ more out of this than necessary. # TServer which had WAL for root went down. # Master (TabletGroupWatcher) will notice this, and calls {{ZooTabletStateStore#unassign}} # WAL entry moves in ZK from {{current_logs}} to {{walogs}} Concurrently, the GC will run a cycle, starting by computing the logs currently unused (using what it sees as the current servers and logs marked as unused) # Read all WALs for root table in ZK from {{walogs}} # Read all WALs from root and metadata (for metadata and user tables, respectively) # Retain only the WALs from root and metadata that are either for servers we know aren't running or WALs marked as {{unused}} and not see in ZK for the root table. # And then more stuff that doesn't matter The issue is that the GC could remove the logs only in use by the root table _before_ the master moves the log from {{current_logs}} to {{walogs}}. * Server hosting root dies -- WAL still is {{current_logs}} * GC runs * GC deletes WAL * Master moves reference from {{current_logs}} to {{walogs}} * Mass sadness If I had to guess, the reason we don't see this more often is that the WAL in use by root is also in use by other tables as well. As a solution, I believe we can read from {{current_logs}} and then {{walogs}} in ZooKeeper instead of only {{walogs}}. Because {{ZooTabletStateStore}} first puts a record in {{walogs}} and then removes it from {{current_logs}}, we can be assured that we'll always see all logs from ZK. > WAL prematurely deleted for root table > -------------------------------------- > > Key: ACCUMULO-3772 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3772 > Project: Accumulo > Issue Type: Bug > Components: gc, tserver > Environment: SHA1 2436eec237dc23608e91cae56777a323e2f77295 (I believe). Running with Kerberos on. HDFS 2.7.1-SNAPSHOT > Reporter: Josh Elser > Assignee: Josh Elser > Priority: Blocker > Fix For: 1.7.0 > > > I have a 5-node test system which I ran all of the ITs against. > When I came back to the cluster later, I observed the following: > * Multiple running MinCs against accumulo.metadata that weren't running > * {{accumulo.root}} without an online tablet > * Multiple queued FATE ops (PopulateMetadata blocked). > Ultimately, found that recovery was repeatedly failing (without any notification on the monitor, mind you). > {noformat} > 2015-05-04 21:44:16,151 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426- > f295-40d7-888e-f4b7dde1f966 > 2015-05-04 21:44:16,151 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f- > f90d-4542-ab24-3ebdfa5d6fe4 > 2015-05-04 21:44:16,152 [state.ZooTabletStateStore] DEBUG: root tablet log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39- > f19e-4f43-bfc6-82f978b8ccb2 > 2015-05-04 21:44:16,152 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: +r<<@(null,null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004]) > 2015-05-04 21:44:16,152 [master.Master] DEBUG: Root Table location State: +r<<@(null,null,os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal:9997[34d1e090f8c0004]) > 2015-05-04 21:44:16,152 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-4.novalocal+9997/fc0e8426-f295-4 > 0d7-888e-f4b7dde1f966 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/fc0e8426-f295-40d7-888e-f4b7dde1f966 > 2015-05-04 21:44:16,171 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-6.novalocal+9997/a1a2c22f-f90d-4 > 542-ab24-3ebdfa5d6fe4 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/a1a2c22f-f90d-4542-ab24-3ebdfa5d6fe4 > 2015-05-04 21:44:16,174 [recovery.RecoveryManager] DEBUG: Recovering hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4 > f43-bfc6-82f978b8ccb2 to hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/recovery/088e3c39-f19e-4f43-bfc6-82f978b8ccb2 > {noformat} > Note *088e3c39-f19e-4f43-bfc6-82f978b8ccb2*. Eventually we see that the file is gone: > {noformat} > 2015-05-04 21:45:15,647 [recovery.RecoveryManager] DEBUG: Unable to initate log sort for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2: java.io.FileNotFoundException: File does not exist: /apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2 > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLease(FSNamesystem.java:2828) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.recoverLease(NameNodeRpcServer.java:666) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.recoverLease(ClientNamenodeProtocolServerSideTranslatorPB.java:663) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) > {noformat} > The most likely candidate to direct hatred towards is the GC > {noformat} > [root@os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2 accumulo]# fgrep 088e3c39-f19e-4f43-bfc6-82f978b8ccb2 gc_os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal.debug.log > 2015-05-04 08:30:15,210 [gc.GarbageCollectWriteAheadLogs] INFO : Checking replication table for hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2 > 2015-05-04 08:30:15,223 [gc.GarbageCollectWriteAheadLogs] DEBUG: WAL not needed for replication hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2 > 2015-05-04 08:30:15,268 [gc.GarbageCollectWriteAheadLogs] DEBUG: Removing unused WAL for server os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal:9997[14d1df69c910000] log hdfs://os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-2.novalocal:8020/apps/accumulo/data/wal/os-hdp-2-3-amb-r6-1430724014-sec-accumulo-1-5.novalocal+9997/088e3c39-f19e-4f43-bfc6-82f978b8ccb2 > {noformat} > Sure enough, the GC deleted the file. I'll be digging from here... -- This message was sent by Atlassian JIRA (v6.3.4#6332)