Return-Path: X-Original-To: apmail-accumulo-notifications-archive@minotaur.apache.org Delivered-To: apmail-accumulo-notifications-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 480CCE597 for ; Tue, 8 Jan 2013 00:00:13 +0000 (UTC) Received: (qmail 10742 invoked by uid 500); 8 Jan 2013 00:00:13 -0000 Delivered-To: apmail-accumulo-notifications-archive@accumulo.apache.org Received: (qmail 10708 invoked by uid 500); 8 Jan 2013 00:00:13 -0000 Mailing-List: contact notifications-help@accumulo.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: jira@apache.org Delivered-To: mailing list notifications@accumulo.apache.org Received: (qmail 10696 invoked by uid 99); 8 Jan 2013 00:00:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jan 2013 00:00:13 +0000 Date: Tue, 8 Jan 2013 00:00:12 +0000 (UTC) From: "John Vines (JIRA)" To: notifications@accumulo.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (ACCUMULO-939) WAL get stuck for unknown reasons MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/ACCUMULO-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13546425#comment-13546425 ] John Vines commented on ACCUMULO-939: ------------------------------------- Dropped memory map to 3G, tweaked ingest to {code}./accumulo org.apache.accumulo.server.test.TestIngest --table test --rows 65536 --cols 100 --size 1024 -z 172.16.101.220:2181 --batchMemory 10000000 --batchThreads 5 --createTable {code} (halved batch writers, dropped an order of magnitude from batch memory, and dropped size to 1K from 8k. Ran once, no issues. Ran again after minor compaction and I had 24 log entries of the similar entries. This is the only different one, I think- {code}Unexpected error writing to log, retrying attempt 1 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /accumulo/wal/172.16.101.219+9997/7a529101-6392-46b0-b59b-d3bcc4a72a9d could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) at org.apache.hadoop.ipc.Client.call(Client.java:1070) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy1.addBlock(Unknown Source) at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3510) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3373) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2589) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2829){code} This had 23 occurrences while the other 23 error messages had 2 each. No repetitive {code}Unexpected error writing to log, retrying attempt 1 java.io.IOException: DFSOutputStream is closed at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3669) at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) at org.apache.accumulo.server.tabletserver.log.DfsLogger.defineTablet(DfsLogger.java:295) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger$4.write(TabletServerLogger.java:333) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:273) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.logManyTablets(TabletServerLogger.java:362) at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.flush(TabletServer.java:1563) at org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler.closeUpdate(TabletServer.java:1634) at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63) at $Proxy2.closeUpdate(Unknown Source) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$closeUpdate.getResult(TabletClientService.java:2018) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$closeUpdate.getResult(TabletClientService.java:2002) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:156) at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478) at org.apache.thrift.server.Invocation.run(Invocation.java:18) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at org.apache.accumulo.cloudtrace.instrument.TraceRunnable.run(TraceRunnable.java:47) at org.apache.accumulo.core.util.LoggingRunnable.run(Log{code} though. Doing this same ingest with a size of 1536 is enough to break it still. Still running hadoop 1.0.4. Ingest DID complete though. > WAL get stuck for unknown reasons > --------------------------------- > > Key: ACCUMULO-939 > URL: https://issues.apache.org/jira/browse/ACCUMULO-939 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.5.0 > Environment: hadoop-1.0.1 > Reporter: John Vines > Assignee: Eric Newton > Attachments: tserver_jstack > > > Attempting to test ACCUMULO-575 with the following test framework: > Test bench- > 1 node running hadoop namenode and 1 datanode > slave noderunning 1 datanode and accumulo stack, with 8GB in memory map > Running patched version of accumulo with the following aptch to provide helper debug > {code}Index: server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java > =================================================================== > --- server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java (revision 1429057) > +++ server/src/main/java/org/apache/accumulo/server/tabletserver/Compactor.java (working copy) > @@ -81,6 +81,7 @@ > private FileSystem fs; > protected KeyExtent extent; > private List iterators; > + protected boolean minor= false; > > Compactor(Configuration conf, FileSystem fs, Map files, InMemoryMap imm, String outputFile, boolean propogateDeletes, > TableConfiguration acuTableConf, KeyExtent extent, CompactionEnv env, List iterators) { > @@ -158,7 +159,7 @@ > log.error("Verification of successful compaction fails!!! " + extent + " " + outputFile, ex); > throw ex; > } > - > + log.info("Just completed minor? " + minor + " for table " + extent.getTableId()); > log.debug(String.format("Compaction %s %,d read | %,d written | %,6d entries/sec | %6.3f secs", extent, majCStats.getEntriesRead(), > majCStats.getEntriesWritten(), (int) (majCStats.getEntriesRead() / ((t2 - t1) / 1000.0)), (t2 - t1) / 1000.0)); > > Index: server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java > =================================================================== > --- server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java (revision 1429057) > +++ server/src/main/java/org/apache/accumulo/server/tabletserver/MinorCompactor.java (working copy) > @@ -88,6 +88,7 @@ > > do { > try { > + this.minor = true; > CompactionStats ret = super.call(); > > // log.debug(String.format("MinC %,d recs in | %,d recs out | %,d recs/sec | %6.3f secs | %,d bytes ",map.size(), entriesCompacted, > {code} > I stood up a new instance, create a table named test. Ran the following - > {code}tail -f accumulo-1.5.0-SNAPSHOT/logs/tserver_slave.debug.log | ./ifttt.sh {code} > where ifttt.sh is > {code} #!/bin/sh > dnpid=`jps -m | grep DataNode | awk '{print $1}'` > while [ -z "" ]; do > if [ -e $1 ] ;then read str; else str=$1;fi > if [ -n "`echo $str | grep "Just completed minor? true for table 2"`" ]; then > echo "I'm gonna kill datanode, pid $dnpid" > kill -9 $dnpid > fi > done > {code} > Then I ran thefollowing > {code}accumulo org.apache.accumulo.server.test.TestIngest --table test --rows 65536 --cols 100 --size 8192 -z 172.16.101.220:2181 --batchMemory 100000000 --batchThreads 10 {code} > Eventually the memory map filled, minor compaction happened, local datanode was killed and things died. Logs filled with- > {code} org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /accumulo/wal/172.16.101.219+9997/08b9f1b4-26d5-4b07-a260-3334c2013576 could only be replicated to 0 nodes, instead of 1 > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1556) > at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:616) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:416) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382) > {code} > and > {code} > Unexpected error writing to log, retrying attempt 1 > java.io.IOException: DFSOutputStream is closed > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3666) > at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97) > at org.apache.accumulo.server.tabletserver.log.DfsLogger.defineTablet(DfsLogger.java:295) > at org.apache.accumulo.server.tabletserver.log.TabletServerLogger$4.write(TabletServerLogger.java:333) > at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:273) > at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) > at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) > at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:254) > at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.write(TabletServerLogger.java:229) > at org.apache.accumulo.server.tabletserver.log.TabletServerLogger.defineTablet(TabletServerLogger.java:330) > ... repeats... > {code}. > Bringing the datanode back up did NOT fix it, either. > UPDATE: reran and never killed datanode and it still died. So this isn't an issue with my datanode killing, it's something with hadop 1.0.1 and the new rite ahead logs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira