systemml-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Matthias Boehm (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SYSTEMML-2397) Paramserv ASP failing w/ OOM (too many threads)
Date Fri, 15 Jun 2018 15:47:00 GMT

    [ https://issues.apache.org/jira/browse/SYSTEMML-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513997#comment-16513997
] 

Matthias Boehm commented on SYSTEMML-2397:
------------------------------------------

Ok, just to make sure we are on the same page:
* Instruction parallelism: If you have 80 vcores, and 80 workers we need to set the parallelism
per instruction to 1. If we have 40 workers, to 2, and so on.
* In order to avoid these issues, try to use the parfor function copy as is. It will decide
the names and place them into the program accordingly. So you only need to adopt the parfor
naming scheme of functions. This will also ensure there are no side effects between different
workers (which might cause the accuracy issues).
* Before calling the function copy, please set the degree of parallelism on hops for the original
program and recompile it to instructions.

> Paramserv ASP failing w/ OOM (too many threads)
> -----------------------------------------------
>
>                 Key: SYSTEMML-2397
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-2397
>             Project: SystemML
>          Issue Type: Bug
>            Reporter: Matthias Boehm
>            Priority: Major
>
> Paramserv ASP with 2 epochs, 80 workers, update per EPOCH failing due to OOM despite
200GB max heap. [~Guobao] could you please have a look? I suspect that the degree of parallelism
of instructions is not set correctly leading to 80x80 concurrent threads. The easiest way
to debug would be to use {{Explain.explain}} to the worker instructions and check that every
instruction has an assigned degree of parallelism of 1.
> {code}
> 2018-06-14 22:31:16 ERROR DMLScript:543 - Failed to execute DML script.
> org.apache.sysml.runtime.DMLRuntimeException: org.apache.sysml.runtime.DMLRuntimeException:
ERROR: Runtime error in program block generated from statement block between lines 0 and 71
-- Error evaluating instruction: CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
> 	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:123)
> 	at org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(ScriptExecutorUtils.java:100)
> 	at org.apache.sysml.api.DMLScript.execute(DMLScript.java:746)
> 	at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:517)
> 	at org.apache.sysml.api.DMLScript.main(DMLScript.java:248)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> 	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> 	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> 	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> 	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> 	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime error in program
block generated from statement block between lines 0 and 71 -- Error evaluating instruction:
CP°paramserv°agg=./mnist_lenet_paramserv.dml::aggregation°checkpointing=NONE°scheme=DISJOINT_CONTIGUOUS°hyperparams=_Var824°upd=./mnist_lenet_paramserv.dml::gradients°utype=ASP°freq=EPOCH°k=80°val_features=_mVar823°batchsize=64°labels=_mVar825°mode=LOCAL°features=_mVar826°model=_Var844°val_labels=_mVar819°epochs=2°_Var845·LIST·UNKNOWN
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:282)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> 	at org.apache.sysml.runtime.controlprogram.Program.execute(Program.java:116)
> 	... 14 more
> Caused by: org.apache.sysml.runtime.DMLRuntimeException: ParamservBuiltinCPInstruction:
some error occurred: 
> 	at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:163)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> 	... 17 more
> Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: unable
to create new native thread
> 	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> 	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
> 	at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruction.processInstruction(ParamservBuiltinCPInstruction.java:158)
> 	... 18 more
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:717)
> 	at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
> 	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
> 	at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExecutorService.java:238)
> 	at org.apache.sysml.runtime.util.CommonThreadPool.invokeAll(CommonThreadPool.java:76)
> 	at org.apache.sysml.runtime.matrix.data.LibMatrixDNN.execute(LibMatrixDNN.java:755)
> 	at org.apache.sysml.runtime.matrix.data.LibMatrixDNN.reluBackward(LibMatrixDNN.java:284)
> 	at org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processReluBackwardInstruction(ConvolutionCPInstruction.java:298)
> 	at org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.processInstruction(ConvolutionCPInstruction.java:465)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleInstruction(ProgramBlock.java:252)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstructions(ProgramBlock.java:210)
> 	at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(ProgramBlock.java:161)
> 	at org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execute(FunctionProgramBlock.java:116)
> 	at org.apache.sysml.runtime.instructions.cp.FunctionCallCPInstruction.processInstruction(FunctionCallCPInstruction.java:152)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeGradients(LocalPSWorker.java:170)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.computeEpoch(LocalPSWorker.java:79)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:58)
> 	at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.call(LocalPSWorker.java:35)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message