systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-2397) Paramserv ASP failing w/ OOM (too many threads)
Date Sat, 16 Jun 2018 02:07:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16514624#comment-16514624
] 

Matthias Boehm commented on SYSTEMML-2397:
------------------------------------------

This patch together with SYSTEMML-2400 fixed the issues. Now it runs fine even with larger
batch sizes (the batch size internally limited the degree of parallelism of these operators).
Furthermore, this patch also significantly improved the runtime performance by avoiding large
overprovisioning.

However, similarly, we should also restrict the instruction parallelism for aggregation, at
least in ASP and EPOCH mode where every worker runs their local aggregation. 

> Paramserv ASP failing w/ OOM (too many threads)
> -----------------------------------------------
>
>                 Key: SYSTEMML-2397
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2397
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Assignee: LI Guobao
>            Priority: Major
>
> Paramserv ASP with 2 epochs, 80 workers, update per EPOCH failing due to OOM despite
200GB max heap. [~Guobao] could you please have a look? I suspect that the degree of parallelism
of instructions is not set correctly leading to 80x80 concurrent threads. The easiest way
to debug would be to use {{Explain.explain}} to the worker instructions and check that every
instruction has an assigned degree of parallelism of 1.
> {code}
> 2018-06-14 22:31:16 ERROR DMLScript:543 - Failed to execute DML script.
> org.apache.sysml.runtime.DMLRuntimeException: org.apache.sysml.runtime.DMLRuntimeException:
ERROR: Runtime error in program block generated from statement block between lines 0 and 71
-- Error evaluating instruction: CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
> 	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
> 	at org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:100)
> 	at org.apache.sysml.api.DMLScript.execute(DMLScript.java:746)
> 	at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:517)
> 	at org.apache.sysml.api.DMLScript.main(DMLScript.java:248)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program
block generated from statement block between lines 0 and 71 -- Error evaluating instruction:
CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:282)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> 	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
> 	... 14 more
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: ParamservBuiltinCPInstruction:
some error occurred: 
> 	at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:163)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> 	... 17 more
> Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: unable
to create new native thread
> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> 	at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:158)
> 	... 18 more
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:717)
> 	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> 	at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:238)
> 	at org.apache.sysml.runtime.util.CommonThreadPool.invokeAll(CommonThreadPool.java:76)
> 	at org.apache.sysml.runtime.matrix.data.LibMatrixDNN.execute(LibMatrixDNN.java:755)
> 	at org.apache.sysml.runtime.matrix.data.LibMatrixDNN.reluBackward(LibMatrixDNN.java:284)
> 	at org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processReluBackwardInstruction(ConvolutionCPInstruction.java:298)
> 	at org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processInstruction(ConvolutionCPInstruction.java:465)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> 	at org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execute(FunctionProgramBlock.java:116)
> 	at org.apache.sysml.runtime.instructions.cp.FunctionCallCPInstruction.processInstruction(FunctionCallCPInstruction.java:152)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeGradients(LocalPSWorker.java:170)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeEpoch(LocalPSWorker.java:79)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:58)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:35)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message