systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "LI Guobao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-2397) Paramserv ASP failing w/ OOM (too many threads)
Date Fri, 15 Jun 2018 15:43:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513991#comment-16513991
] 

LI Guobao commented on SYSTEMML-2397:
-------------------------------------

Afterwards, I have checked the level of par for the instruction. And it seems to me that all
of them is assigned 1 when the workers num == vcores num.

> Paramserv ASP failing w/ OOM (too many threads)
> -----------------------------------------------
>
>                 Key: SYSTEMML-2397
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2397
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Priority: Major
>
> Paramserv ASP with 2 epochs, 80 workers, update per EPOCH failing due to OOM despite
200GB max heap. [~Guobao] could you please have a look? I suspect that the degree of parallelism
of instructions is not set correctly leading to 80x80 concurrent threads. The easiest way
to debug would be to use {{Explain.explain}} to the worker instructions and check that every
instruction has an assigned degree of parallelism of 1.
> {code}
> 2018-06-14 22:31:16 ERROR DMLScript:543 - Failed to execute DML script.
> org.apache.sysml.runtime.DMLRuntimeException: org.apache.sysml.runtime.DMLRuntimeException:
ERROR: Runtime error in program block generated from statement block between lines 0 and 71
-- Error evaluating instruction: CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
> 	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
> 	at org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:100)
> 	at org.apache.sysml.api.DMLScript.execute(DMLScript.java:746)
> 	at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:517)
> 	at org.apache.sysml.api.DMLScript.main(DMLScript.java:248)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program
block generated from statement block between lines 0 and 71 -- Error evaluating instruction:
CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:282)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> 	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
> 	... 14 more
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: ParamservBuiltinCPInstruction:
some error occurred: 
> 	at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:163)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> 	... 17 more
> Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: unable
to create new native thread
> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> 	at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:158)
> 	... 18 more
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:717)
> 	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> 	at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:238)
> 	at org.apache.sysml.runtime.util.CommonThreadPool.invokeAll(CommonThreadPool.java:76)
> 	at org.apache.sysml.runtime.matrix.data.LibMatrixDNN.execute(LibMatrixDNN.java:755)
> 	at org.apache.sysml.runtime.matrix.data.LibMatrixDNN.reluBackward(LibMatrixDNN.java:284)
> 	at org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processReluBackwardInstruction(ConvolutionCPInstruction.java:298)
> 	at org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processInstruction(ConvolutionCPInstruction.java:465)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> 	at org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execute(FunctionProgramBlock.java:116)
> 	at org.apache.sysml.runtime.instructions.cp.FunctionCallCPInstruction.processInstruction(FunctionCallCPInstruction.java:152)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeGradients(LocalPSWorker.java:170)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeEpoch(LocalPSWorker.java:79)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:58)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:35)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message