From issues-return-8675-archive-asf-public=cust-asf.ponee.io@systemml.apache.org Sat Jun 16 04:07:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 37680180636 for ; Sat, 16 Jun 2018 04:07:04 +0200 (CEST) Received: (qmail 41172 invoked by uid 500); 16 Jun 2018 02:07:03 -0000 Mailing-List: contact issues-help@systemml.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.apache.org Delivered-To: mailing list issues@systemml.apache.org Received: (qmail 41162 invoked by uid 99); 16 Jun 2018 02:07:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 16 Jun 2018 02:07:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id DEACEC035B for ; Sat, 16 Jun 2018 02:07:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 7mh2Q6ePO2Xq for ; Sat, 16 Jun 2018 02:07:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 071005F33D for ; Sat, 16 Jun 2018 02:07:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 86C81E0C92 for ; Sat, 16 Jun 2018 02:07:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 3DE412183F for ; Sat, 16 Jun 2018 02:07:00 +0000 (UTC) Date: Sat, 16 Jun 2018 02:07:00 +0000 (UTC) From: "Matthias Boehm (JIRA)" To: issues@systemml.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (SYSTEMML-2397) Paramserv ASP failing w/ OOM (too many threads) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/SYSTEMML-2397?page=3Dcom.atlass= ian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1= 6514624#comment-16514624 ]=20 Matthias Boehm commented on SYSTEMML-2397: ------------------------------------------ This patch together with SYSTEMML-2400 fixed the issues. Now it runs fine e= ven with larger batch sizes (the batch size internally limited the degree o= f parallelism of these operators). Furthermore, this patch also significant= ly improved the runtime performance by avoiding large overprovisioning. However, similarly, we should also restrict the instruction parallelism for= aggregation, at least in ASP and EPOCH mode where every worker runs their = local aggregation.=20 > Paramserv ASP failing w/ OOM (too many threads) > ----------------------------------------------- > > Key: SYSTEMML-2397 > URL: https://issues.apache.org/jira/browse/SYSTEMML-2397 > Project: SystemML > Issue Type: Bug > Reporter: Matthias Boehm > Assignee: LI Guobao > Priority: Major > > Paramserv ASP with 2 epochs, 80 workers, update per EPOCH failing due to = OOM despite 200GB max heap. [~Guobao] could you please have a look? I suspe= ct that the degree of parallelism of instructions is not set correctly lead= ing to 80x80 concurrent threads. The easiest way to debug would be to use {= {Explain.explain}} to the worker instructions and check that every instruct= ion has an assigned degree of parallelism of 1. > {code} > 2018-06-14 22:31:16 ERROR DMLScript:543 - Failed to execute DML script. > org.apache.sysml.runtime.DMLRuntimeException: org.apache.sysml.runtime.DM= LRuntimeException: ERROR: Runtime error in program block generated from sta= tement block between lines 0 and 71 -- Error evaluating instruction: CP=C3= =82=C2=B0paramserv=C3=82=C2=B0agg=3D./mnist_lenet_paramserv.dml::aggregatio= n=C3=82=C2=B0checkpointing=3DNONE=C3=82=C2=B0scheme=3DDISJOINT_CONTIGUOUS= =C3=82=C2=B0hyperparams=3D_Var824=C3=82=C2=B0upd=3D./mnist_lenet_paramserv.= dml::gradients=C3=82=C2=B0utype=3DASP=C3=82=C2=B0freq=3DEPOCH=C3=82=C2=B0k= =3D80=C3=82=C2=B0val_features=3D_mVar823=C3=82=C2=B0batchsize=3D64=C3=82=C2= =B0labels=3D_mVar825=C3=82=C2=B0mode=3DLOCAL=C3=82=C2=B0features=3D_mVar826= =C3=82=C2=B0model=3D_Var844=C3=82=C2=B0val_labels=3D_mVar819=C3=82=C2=B0epo= chs=3D2=C3=82=C2=B0_Var845=C3=82=C2=B7LIST=C3=82=C2=B7UNKNOWN > =09at org.apache.sysml.runtime.controlprogram.Program.execute(Program.jav= a:123) > =09at org.apache.sysml.api.ScriptExecutorUtils.executeRuntimeProgram(Scri= ptExecutorUtils.java:100) > =09at org.apache.sysml.api.DMLScript.execute(DMLScript.java:746) > =09at org.apache.sysml.api.DMLScript.executeScript(DMLScript.java:517) > =09at org.apache.sysml.api.DMLScript.main(DMLScript.java:248) > =09at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > =09at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp= l.java:62) > =09at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc= essorImpl.java:43) > =09at java.lang.reflect.Method.invoke(Method.java:498) > =09at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.= scala:52) > =09at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkS= ubmit$$runMain(SparkSubmit.scala:879) > =09at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:= 197) > =09at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) > =09at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) > =09at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: org.apache.sysml.runtime.DMLRuntimeException: ERROR: Runtime e= rror in program block generated from statement block between lines 0 and 71= -- Error evaluating instruction: CP=C3=82=C2=B0paramserv=C3=82=C2=B0agg=3D= ./mnist_lenet_paramserv.dml::aggregation=C3=82=C2=B0checkpointing=3DNONE=C3= =82=C2=B0scheme=3DDISJOINT_CONTIGUOUS=C3=82=C2=B0hyperparams=3D_Var824=C3= =82=C2=B0upd=3D./mnist_lenet_paramserv.dml::gradients=C3=82=C2=B0utype=3DAS= P=C3=82=C2=B0freq=3DEPOCH=C3=82=C2=B0k=3D80=C3=82=C2=B0val_features=3D_mVar= 823=C3=82=C2=B0batchsize=3D64=C3=82=C2=B0labels=3D_mVar825=C3=82=C2=B0mode= =3DLOCAL=C3=82=C2=B0features=3D_mVar826=C3=82=C2=B0model=3D_Var844=C3=82=C2= =B0val_labels=3D_mVar819=C3=82=C2=B0epochs=3D2=C3=82=C2=B0_Var845=C3=82=C2= =B7LIST=C3=82=C2=B7UNKNOWN > =09at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleI= nstruction(ProgramBlock.java:282) > =09at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstruc= tions(ProgramBlock.java:210) > =09at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(Progra= mBlock.java:161) > =09at org.apache.sysml.runtime.controlprogram.Program.execute(Program.jav= a:116) > =09... 14 more > Caused by: org.apache.sysml.runtime.DMLRuntimeException: ParamservBuiltin= CPInstruction: some error occurred:=20 > =09at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruct= ion.processInstruction(ParamservBuiltinCPInstruction.java:163) > =09at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleI= nstruction(ProgramBlock.java:252) > =09... 17 more > Caused by: java.util.concurrent.ExecutionException: java.lang.OutOfMemory= Error: unable to create new native thread > =09at java.util.concurrent.FutureTask.report(FutureTask.java:122) > =09at java.util.concurrent.FutureTask.get(FutureTask.java:192) > =09at org.apache.sysml.runtime.instructions.cp.ParamservBuiltinCPInstruct= ion.processInstruction(ParamservBuiltinCPInstruction.java:158) > =09... 18 more > Caused by: java.lang.OutOfMemoryError: unable to create new native thread > =09at java.lang.Thread.start0(Native Method) > =09at java.lang.Thread.start(Thread.java:717) > =09at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecuto= r.java:957) > =09at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.= java:1367) > =09at java.util.concurrent.AbstractExecutorService.invokeAll(AbstractExec= utorService.java:238) > =09at org.apache.sysml.runtime.util.CommonThreadPool.invokeAll(CommonThre= adPool.java:76) > =09at org.apache.sysml.runtime.matrix.data.LibMatrixDNN.execute(LibMatrix= DNN.java:755) > =09at org.apache.sysml.runtime.matrix.data.LibMatrixDNN.reluBackward(LibM= atrixDNN.java:284) > =09at org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.p= rocessReluBackwardInstruction(ConvolutionCPInstruction.java:298) > =09at org.apache.sysml.runtime.instructions.cp.ConvolutionCPInstruction.p= rocessInstruction(ConvolutionCPInstruction.java:465) > =09at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeSingleI= nstruction(ProgramBlock.java:252) > =09at org.apache.sysml.runtime.controlprogram.ProgramBlock.executeInstruc= tions(ProgramBlock.java:210) > =09at org.apache.sysml.runtime.controlprogram.ProgramBlock.execute(Progra= mBlock.java:161) > =09at org.apache.sysml.runtime.controlprogram.FunctionProgramBlock.execut= e(FunctionProgramBlock.java:116) > =09at org.apache.sysml.runtime.instructions.cp.FunctionCallCPInstruction.= processInstruction(FunctionCallCPInstruction.java:152) > =09at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.com= puteGradients(LocalPSWorker.java:170) > =09at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.com= puteEpoch(LocalPSWorker.java:79) > =09at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.cal= l(LocalPSWorker.java:58) > =09at org.apache.sysml.runtime.controlprogram.paramserv.LocalPSWorker.cal= l(LocalPSWorker.java:35) > =09at java.util.concurrent.FutureTask.run(FutureTask.java:266) > =09at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecuto= r.java:1149) > =09at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecut= or.java:624) > =09at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)