Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 21779176C5 for ; Wed, 6 May 2015 15:02:01 +0000 (UTC) Received: (qmail 99613 invoked by uid 500); 6 May 2015 15:02:00 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 99579 invoked by uid 500); 6 May 2015 15:02:00 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 99564 invoked by uid 99); 6 May 2015 15:02:00 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 06 May 2015 15:02:00 +0000 Date: Wed, 6 May 2015 15:02:00 +0000 (UTC) From: "Keith Turner (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-3775) Root tablet had 6,974 walogs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530669#comment-14530669 ] Keith Turner commented on ACCUMULO-3775: ---------------------------------------- I think when a walog is opened the header is written, but its not synced. We could possibly have the new background thread thats opening logs before they are needed try to sync the log (writing out the header). If this sync fails it could delete the walog and retry until successful. This should prevent tons of walogs that can not be written to from being added to zookeeper or metadata. I can try making this change this afternoon, if no one has any issues with this concept. > Root tablet had 6,974 walogs > ---------------------------- > > Key: ACCUMULO-3775 > URL: https://issues.apache.org/jira/browse/ACCUMULO-3775 > Project: Accumulo > Issue Type: Bug > Environment: Same as ACCUMULO-3774 > Reporter: Keith Turner > Assignee: Eric Newton > Priority: Blocker > Fix For: 1.7.0 > > Attachments: ACCUMULO_3775-01.patch > > > Before the deadlock described in ACCUMULO-3774, the root tablet recovered 6,974 walogs. Almost all of theses were empty. Before the tserver was killed there were thousands of messages like the following (I think this was caused by datanode agitation). > {noformat} > 2015-05-05 18:02:43,236 [log.TabletServerLogger] INFO : Using next log hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/a13aee79-c313-4298-b55a-8ec58ffb977c > 2015-05-05 18:02:43,236 [log.TabletServerLogger] DEBUG: Creating next WAL > 2015-05-05 18:02:43,236 [tserver.TabletServer] INFO : Writing log marker for level ROOT hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/a13aee79-c313-4298-b55a-8ec58ffb977c > 2015-05-05 18:02:43,236 [log.DfsLogger] DEBUG: Address is worker10:9997 > 2015-05-05 18:02:43,236 [log.DfsLogger] DEBUG: DfsLogger.open() begin > 2015-05-05 18:02:43,236 [util.MetadataTableUtil] DEBUG: Adding log entry hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/a13aee79-c313-4298-b55a-8ec58ffb977c > 2015-05-05 18:02:43,237 [fs.VolumeManagerImpl] DEBUG: creating hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/295244ee-c9e3-404f-a3d8-9569e41ba8e1 with CreateFlag set: [CREATE, SYNC_BLOCK] > 2015-05-05 18:02:43,246 [tserver.TabletServer] INFO : Writing log marker for level NORMAL hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/a13aee79-c313-4298-b55a-8ec58ffb977c > 2015-05-05 18:02:43,247 [util.MetadataTableUtil] DEBUG: Adding log entry hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/a13aee79-c313-4298-b55a-8ec58ffb977c > 2015-05-05 18:02:43,247 [log.DfsLogger] DEBUG: No enciphering, using raw output stream > 2015-05-05 18:02:43,247 [log.DfsLogger] DEBUG: Got new write-ahead log: worker10:9997/hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/295244ee-c9e3-404f-a3d8-9569e41ba8e1 > 2015-05-05 18:02:43,250 [hdfs.DFSClient] WARN : DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /accumulo/wal/worker10+9997/a13aee79-c313-4298-b55a-8ec58ffb977c could only be replicated to 2 nodes instead of minReplication (=3). There are 16 datanode(s) running and n > o node(s) are excluded in this operation. > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3067) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:722) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) > at org.apache.hadoop.ipc.Client.call(Client.java:1476) > at org.apache.hadoop.ipc.Client.call(Client.java:1407) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy15.addBlock(Unknown Source) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418) > at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy16.addBlock(Unknown Source) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449) > {noformat} > {noformat} > 2015-05-05 18:02:43,352 [log.TabletServerLogger] INFO : Using next log hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/295244ee-c9e3-404f-a3d8-9569e41ba8e1 > 2015-05-05 18:02:43,352 [log.TabletServerLogger] DEBUG: Creating next WAL > 2015-05-05 18:02:43,352 [tserver.TabletServer] INFO : Writing log marker for level ROOT hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/295244ee-c9e3-404f-a3d8-9569e41ba8e1 > 2015-05-05 18:02:43,352 [log.DfsLogger] DEBUG: Address is worker10:9997 > 2015-05-05 18:02:43,352 [log.DfsLogger] DEBUG: DfsLogger.open() begin > 2015-05-05 18:02:43,353 [util.MetadataTableUtil] DEBUG: Adding log entry hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/295244ee-c9e3-404f-a3d8-9569e41ba8e1 > 2015-05-05 18:02:43,353 [fs.VolumeManagerImpl] DEBUG: creating hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/1810b018-26e3-4728-bbab-e3d901e3edd3 with CreateFlag set: [CREATE, SYNC_BLOCK] > 2015-05-05 18:02:43,362 [log.DfsLogger] DEBUG: No enciphering, using raw output stream > 2015-05-05 18:02:43,362 [log.DfsLogger] DEBUG: Got new write-ahead log: worker10:9997/hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/1810b018-26e3-4728-bbab-e3d901e3edd3 > 2015-05-05 18:02:43,366 [log.TabletServerLogger] DEBUG: Created next WAL hdfs://10.1.5.21:10000/accumulo/wal/worker10+9997/1810b018-26e3-4728-bbab-e3d901e3edd3 > 2015-05-05 18:02:43,366 [hdfs.DFSClient] WARN : DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /accumulo/wal/worker10+9997/295244ee-c9e3-404f-a3d8-9569e41ba8e1 could only be replicated to 2 nodes instead of minReplication (=3). There are 16 datanode(s) running and no node(s) are excluded in this operation. > at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1550) > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3067) > at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:722) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492) > at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:415) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043) > at org.apache.hadoop.ipc.Client.call(Client.java:1476) > at org.apache.hadoop.ipc.Client.call(Client.java:1407) > at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy15.addBlock(Unknown Source) > at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:418) > at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy16.addBlock(Unknown Source) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1430) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1226) > at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)