accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (ACCUMULO-3575) Accumulo GC ran out of memory
Date Tue, 10 Feb 2015 21:10:12 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14314932#comment-14314932
] 

Keith Turner commented on ACCUMULO-3575:
----------------------------------------

Saw this problem while testing 1.6.2 RC4.   Problem probably exist in 1.6.0 though.   In most
cases this problem can easily be worked around by increasing Accumulo GC memory.   There could
be a worst case scenario where an error condition creates too many walog to ever fit into
memory. 

> Accumulo GC ran out of memory
> -----------------------------
>
>                 Key: ACCUMULO-3575
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-3575
>             Project: Accumulo
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Keith Turner
>            Priority: Minor
>
> During CI run (w/ agitation) on 20 node EC2 cluster the Accumulo GC died with the following
errors.
> Following was in gc out file
> {noformat}
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 20970"...
> {noformat}
> Following was in last lines of .log file
> {noformat}
> 2015-02-10 20:19:03,255 [gc.SimpleGarbageCollector] INFO : Collect cycle took 13.07 seconds
> 2015-02-10 20:19:03,258 [gc.SimpleGarbageCollector] INFO : Beginning garbage collection
of write-ahead logs
> 2015-02-10 20:19:03,265 [zookeeper.ZooUtil] DEBUG: Trying to read instance id from hdfs://ip-10-1-2-11:9000/accumulo/instance_id
> {noformat}
> Restarted GC and same thing happened.   Looked in walog dir and saw there were 333k walog.
 This is the problem, the GC tries to read the list of files into memory.
> {noformat}
> $ hadoop fs -ls -R /accumulo/wal | wc
> 15/02/10 20:31:35 WARN util.NativeCodeLoader: Unable to load native-hadoop library for
your platform... using builtin-java classes where applicable
>  333053 2664424 43629314
> {noformat}
> I suspect the reason there were so many walogs is because there were many many failure
like the following (which resulted in 0 length walogs, only 199 of the 333K have non-zero
length).  The following error is from a tserver, which is probably a result of killing datanodes.
> {noformat}
> 2015-02-10 03:45:00,447 [log.TabletServerLogger] ERROR: Unexpected error writing to log,
retrying attempt 122
> java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(java.io.IOException):
File /accumulo/wal/ip-10-1-2-21+9997/9906de55-bc93-47f4-887c-4b9540fc3528 could only be replicated
to 0 nodes instead of minReplication (=1).  There 
> are 16 datanode(s) running and no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>         at org.apache.accumulo.tserver.log.TabletServerLogger.createLoggers(TabletServerLogger.java:190)
>         at org.apache.accumulo.tserver.log.TabletServerLogger.access$300(TabletServerLogger.java:53)
>         at org.apache.accumulo.tserver.log.TabletServerLogger$1.withWriteLock(TabletServerLogger.java:148)
>         at org.apache.accumulo.tserver.log.TabletServerLogger.testLockAndRun(TabletServerLogger.java:115)
>         at org.apache.accumulo.tserver.log.TabletServerLogger.initializeLoggers(TabletServerLogger.java:137)
>         at org.apache.accumulo.tserver.log.TabletServerLogger.write(TabletServerLogger.java:245)
>         at org.apache.accumulo.tserver.log.TabletServerLogger.write(TabletServerLogger.java:230)
>         at org.apache.accumulo.tserver.log.TabletServerLogger.log(TabletServerLogger.java:345)
>         at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.update(TabletServer.java:1817)
>         at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.accumulo.trace.instrument.thrift.RpcServerInvocationHandler.invoke(RpcServerInvocationHandler.java:46)
>         at org.apache.accumulo.server.util.RpcWrapper$1.invoke(RpcWrapper.java:47)
>         at com.sun.proxy.$Proxy22.update(Unknown Source)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(TabletClientService.java:2394)
>         at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$update.getResult(TabletClientService.java:2378)
>         at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>         at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>         at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:168)
>         at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:516)
>         at org.apache.accumulo.server.util.CustomNonBlockingServer$1.run(CustomNonBlockingServer.java:77)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
>         at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
>         at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /accumulo/wal/ip-10-1-2-21+9997/9906de55-bc93-47f4-887c-4b9540fc3528
could only be replicated to 0 nodes instead of minReplication (=1).  There are 16 datanode(s
> ) running and no node(s) are excluded in this operation.
>         at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1549)
>         at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3200)
>         at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:641)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
>         at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:415)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1468)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1399)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
>         at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy21.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
>         at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
> {noformat}
> Upped gc max mem from 256k to 2G and it ran ok.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message