hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Cheng (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions
Date Fri, 27 Sep 2019 10:14:00 GMT

    [ https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16939296#comment-16939296
] 

Li Cheng commented on HDDS-2186:
--------------------------------

[~ljain] After some investigation, it turned out MiniOzoneCluster is abusing resources to
create pipelines. Reason it didn't have problems before is that every datanode could only
be assigned to one pipeline so that the quota runs out fast. Now the limit is taken off and
there is no virtual limit to prevent cluster from creating pipelines other than ratis says
resource like memory is not enough. I'm adding logic to prevent this, but unfortunately, factor
ONE and factor THREE pipelines need to be handled differently, the logic grows more and more
complex. 

> Fix tests using MiniOzoneCluster for its memory related exceptions
> ------------------------------------------------------------------
>
>                 Key: HDDS-2186
>                 URL: https://issues.apache.org/jira/browse/HDDS-2186
>             Project: Hadoop Distributed Data Store
>          Issue Type: Sub-task
>    Affects Versions: HDDS-1564
>            Reporter: Li Cheng
>            Priority: Major
>              Labels: flaky-test
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a bunch of 'out
of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
ERROR segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:run(323)) - 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
hit exception2019-09-26 15:12:22,824 [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
ERROR segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:run(323)) - 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:694)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.<init>(BufferedWriteChannel.java:41)
at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.<init>(SegmentedRaftLogOutputStream.java:72)
at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
at org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider (RatisPipelineProvider.java:lambda$null$2(181))
- Failed invoke Ratis rpc org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR
pipeline.RatisPipelineProvider (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke
Ratis rpc org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
for c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException: deadline
exceeded after 2999881264ns at org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82)
at org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
at org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
at org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) at org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
at org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) at org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
at org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734)
at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) at org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
at java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386) at
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)Caused
by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline
exceeded after 2999881264ns at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:233)
at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:214)
at org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:139)
at org.apache.ratis.proto.grpc.AdminProtocolServiceGrpc$AdminProtocolServiceBlockingStub.groupManagement(AdminProtocolServiceGrpc.java:274)
at org.apache.ratis.grpc.client.GrpcClientProtocolClient.lambda$groupAdd$3(GrpcClientProtocolClient.java:149)
at org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:176)
... 25 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message