Return-Path: X-Original-To: apmail-systemml-dev-archive@minotaur.apache.org Delivered-To: apmail-systemml-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 40D3418F9F for ; Tue, 17 Nov 2015 19:10:43 +0000 (UTC) Received: (qmail 31470 invoked by uid 500); 17 Nov 2015 19:10:43 -0000 Delivered-To: apmail-systemml-dev-archive@systemml.apache.org Received: (qmail 31429 invoked by uid 500); 17 Nov 2015 19:10:43 -0000 Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@systemml.incubator.apache.org Delivered-To: mailing list dev@systemml.incubator.apache.org Received: (qmail 31417 invoked by uid 99); 17 Nov 2015 19:10:42 -0000 Received: from Unknown (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Nov 2015 19:10:42 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 63A22C2CFB for ; Tue, 17 Nov 2015 19:10:42 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.9 X-Spam-Level: ** X-Spam-Status: No, score=2.9 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=3, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id C90TH8oujbNM for ; Tue, 17 Nov 2015 19:10:31 +0000 (UTC) Received: from mail-oi0-f47.google.com (mail-oi0-f47.google.com [209.85.218.47]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id B87A722F1D for ; Tue, 17 Nov 2015 19:10:30 +0000 (UTC) Received: by oiww189 with SMTP id w189so10995790oiw.3 for ; Tue, 17 Nov 2015 11:10:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=VqyrpPfwJ0H76OdP+5D/FGCsetE2IiKDSWGnKBWQMPs=; b=ttaFqWHH1MdeteF74876yN/Y+WtO+KfnC9NtRJov/Ep+aIzOTjG+6sUmUXO0WCW99g dIPZqmeFkE2DT/UGsfRAznsG6s3Mfw7q94ISoNDKMtMckjhKfNWinSAbljPPqJVR9m1O sbRDGMZHjO5GvC3P8WXKyAULuXeJOBB847ccg3ZmSeHkpBTiM0/hwUmNEh/OX5Yay3Fi VO28+WkY+snAbRhrjuSv62DNg/g+3e7Be7gj5Y0Bs98jxC4+7hAomquzr+sc4WO0oRHx WkozrACR7XeeRtOkoHu9vfaEv4Tf5Cj1YQ7a87Cc0gOhRP4VCL/XSlQ+5g/nrMCod4+n Uwaw== MIME-Version: 1.0 X-Received: by 10.202.188.134 with SMTP id m128mr25061251oif.70.1447787424082; Tue, 17 Nov 2015 11:10:24 -0800 (PST) Received: by 10.202.60.87 with HTTP; Tue, 17 Nov 2015 11:10:23 -0800 (PST) In-Reply-To: <201511170024.tAH0O9gQ012036@d03av01.boulder.ibm.com> References: <201511140017.tAE0HslM009160@d03av05.boulder.ibm.com> <201511170024.tAH0O9gQ012036@d03av01.boulder.ibm.com> Date: Tue, 17 Nov 2015 11:10:23 -0800 Message-ID: Subject: Re: Out of memory error and problem generating heap dump From: Deron Eriksson To: dev@systemml.incubator.apache.org Content-Type: multipart/alternative; boundary=001a113dcc821f723c0524c14768 --001a113dcc821f723c0524c14768 Content-Type: text/plain; charset=UTF-8 Hello, Thank you for the help, Matthias. Explicitly bumping up "mapreduce.map.memory.mb" and "mapreduce.reduce.memory.mb" in mapred-site.xml took care of the memory issues that I had been hitting with Kmeans in Hadoop batch mode. Deron On Mon, Nov 16, 2015 at 4:24 PM, wrote: > well, I think you're on the right track but your cluster configuration > still has a couple of issues. > > The error tells us that you're not actually running out of memory but your > tasks are killed by the node managers because you are exceeding the > allocated virtual container memory. So here are a couple of things to > check: > > 1) Consistent container configuration: You already modified jvm options for > map/reduce tasks (e.g., mapreduce.map.java.opts). View them as > configurations of your actual processes. In addition, you have to ensure > that you request consistent container resources for these tasks. Please, > double check in mapred-site.xml the 'mapreduce.map.memory.mb' and > 'mapreduce.reduce.memory.mb' (the mapred AM request container resources > according to these configurations, which also need to cover JVM overheads) > - I usually configure them conservatively to 1.5x the max heap > configuration of my tasks. > > 2) Virtual memory configuration: Also, please ensure that you allow a > sufficiently large ratio between allocated virtual and physical memory. > Overcommitting virtual memory is fine. Please check in yarn-site.xml the > following property: 'yarn.nodemanager.vmem-pmem-ratio' - I usually > configure this to something between 2 and 5. If this does not solve your > problem, you can also disable that your task processes are killed in these > situations by setting 'yarn.nodemanager.vmem-check-enabled' to false. > > Regards, > Matthias > > > > > > From: Deron Eriksson > To: dev@systemml.incubator.apache.org > Date: 11/16/2015 02:58 PM > Subject: Re: Out of memory error and problem generating heap dump > > > > Hello Matthias, > > Thank you for the help! I'm still running into issues so I was wondering if > you have any further guidance. I think the main question I have is if I am > setting memory and garbage collection options in the right place, since > it's a multi-node and multi-JVM environment. > > With regards to your point (1): > I updated my mapred-site.xml mapreduce.map.java.opts property to > "XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/home/hadoop2/heapdumps-map/" and my > mapreduce.reduce.java.opts property to "XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/home/hadoop2/heapdumps-reduce/". > > If I reran (Kmeans.dml with 1M-row matrix), I had the same errors at this > point, but the log messages provided further useful information now. The > physical memory appeared to be fine but the virtual memory had an issue: > 15/11/16 10:36:05 INFO mapreduce.Job: Task Id : > attempt_1447698794207_0001_m_000015_2, Status : FAILED > Container [pid=63900,containerID=container_1447698794207_0001_01_000072] is > running beyond virtual memory limits. Current usage: 165.6 MB of 1 GB > physical memory used; 3.7 GB of 2.1 GB virtual memory used. Killing > container. > > > Next I looked at points (2) and (3): > I updated mapreduce.map.java.opts to "-server -Xmx2g -Xms2g -Xmn200m > -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/home/hadoop2/heapdumps-map/" and > mapreduce.reduce.java.opts to "-server -Xmx2g -Xms2g -Xmn200m > -XX:+HeapDumpOnOutOfMemoryError > -XX:HeapDumpPath=/home/hadoop2/heapdumps-reduce/". > > This resulted in the same errors as before. > > Am I setting the memory and garbage collection options in the right place > to the right JVMs? > > Each node has 12GB RAM and about 60GB (of 144GB) free HD space. > > Thanks! > Deron > > > > > On Fri, Nov 13, 2015 at 4:17 PM, wrote: > > > Hi Deron, > > > > couple of things to try out: > > > > 1) Task Configuration: please double check you configuration; if the > errors > > are coming from the individual map/reduce tasks, please change > > 'mapreduce.map.java.opts' and 'mapreduce.reduce.java.opts' in your > > mapred-site.xml. The name node / data node configurations don't have any > > effect on the actual tasks. > > 2) Recommended mem config: Normally, we recommend a configuration of > -Xmx2g > > -Xms2g -Xmn200m for map/reduce tasks (if this still allows a task/core) > w/ > > a io.sort.mb of 384 MB for an hdfs blocksize of 128MB. Note the -mn > > parameter, which fixes the size of the young generation; this size also > > affects additional memory overheads - if set to 10% of your max heap, we > > guarantee that your tasks will not run out of memory. > > 3) GC Overhead: You're not getting the OOM because you actually ran out > of > > memory but because you spent to much time on garbage collection (because > > you are close to the mem limits). If you're running OpenJDK, it's usually > a > > good idea to specify the '-server' flag. If this does not help, you might > > want to increase the number of threads for garbage collection. > > 4) Explain w/ memory estimates: Finally, there is always a possibility of > > bugs too. If the configuration changes above do not solve the problem, > > please run it with "-explain recompile_hops" and subsequently "-explain > > recompile_runtime" which will give you the memory estimates - things to > > look for are broadcast-based operators where the size of vectors exceed > the > > budgets of your tasks and instructions that generate large outputs. > > > > > > Regards, > > Matthias > > > > > > > > > > > > From: Deron Eriksson > > To: dev@systemml.incubator.apache.org > > Date: 11/13/2015 03:25 PM > > Subject: Out of memory error and problem generating heap dump > > > > > > > > Hello, > > > > I'm running into an out-of-memory issue when I attempt to use the > > Kmeans.dml algorithm on a 1M-row matrix of generated test data. I am > trying > > to generate a heap dump in order to help diagnose the problem but so far > I > > haven't been able to correctly generate a heap dump file. I was wondering > > if anyone has any advice regarding the out-of-memory issue and creating a > > heap dump to help diagnose the problem. > > > > I set up a 4-node Hadoop cluster (on Red Hat Enterprise Linux Server > > release 6.6 (Santiago)) with HDFS and YARN to try out SystemML in Hadoop > > batch mode. The master node has NameNode, SecondaryNameNode, and > > ResourceManager daemons running on it. The 3 other nodes have DataNode > and > > NodeManager daemons running on them. > > > > I'm trying out the Kmeans.dml algorithm. To begin, I generated test data > > using the genRandData4Kmeans.dml script with 100K rows via: > > > > hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs > > nr=100000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=Xsmall.mtx > > C=Csmall.mtx Y=Ysmall.mtx YbyC=YbyCsmall.mtx > > > > Next, I ran Kmeans.dml against the Xsmall.mtx 100K-row matrix via: > > > > hadoop jar system-ml-0.8.0/SystemML.jar -f > > system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=Xsmall.mtx k=5 > > > > This ran perfectly. > > > > However, next I increased the amount of test data to 1M rows, which > > resulted in matrix data of about 3GB in size: > > > > hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs > > nr=1000000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=X.mtx > C=C.mtx > > Y=Y.mtx YbyC=YbyC.mtx > > > > I ran Kmeans.dml against the 1M-row X.mtx matrix via: > > > > hadoop jar system-ml-0.8.0/SystemML.jar -f > > system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=X.mtx k=5 > > > > In my console, I received a number of error messages such as: > > > > Error: Java heap space > > 15/11/13 14:48:58 INFO mapreduce.Job: Task Id : > > attempt_1447452404596_0006_m_000023_1, Status : FAILED > > Error: GC overhead limit exceeded > > > > Next, I attempted to generate a heap dump. Additionally, I added some > > settings so that I could look at memory usage remotely using JConsole. > > > > I added the following lines to my hadoop-env.sh files on each node: > > > > export HADOOP_NAMENODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError > > -XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/ > > -Dcom.sun.management.jmxremote.port=9999 > > -Dcom.sun.management.jmxremote.authenticate=false > > -Dcom.sun.management.jmxremote.ssl=false > > -Dcom.sun.management.jmxremote.local.only=false ${HADOOP_NAMENODE_OPTS}" > > > > export HADOOP_DATANODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError > > -XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/ > > -Dcom.sun.management.jmxremote.port=9999 > > -Dcom.sun.management.jmxremote.authenticate=false > > -Dcom.sun.management.jmxremote.ssl=false > > -Dcom.sun.management.jmxremote.local.only=false ${HADOOP_DATANODE_OPTS}" > > > > I added the following to my yarn-env.sh files on each node: > > > > export YARN_RESOURCEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError > > -XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/ > > -Dcom.sun.management.jmxremote.port=9998 > > -Dcom.sun.management.jmxremote.authenticate=false > > -Dcom.sun.management.jmxremote.ssl=false > > -Dcom.sun.management.jmxremote.local.only=false > > ${YARN_RESOURCEMANAGER_OPTS}" > > > > export YARN_NODEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError > > -XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/ > > -Dcom.sun.management.jmxremote.port=9998 > > -Dcom.sun.management.jmxremote.authenticate=false > > -Dcom.sun.management.jmxremote.ssl=false > > -Dcom.sun.management.jmxremote.local.only=false ${YARN_NODEMANAGER_OPTS}" > > > > Additionally, I modified the bin/hadoop file: > > > > HADOOP_OPTS="$HADOOP_OPTS -XX:+HeapDumpOnOutOfMemoryError > > -XX:HeapDumpPath=/home/hadoop2/heapdumps/ > > -Dcom.sun.management.jmxremote.port=9997 > > -Dcom.sun.management.jmxremote.authenticate=false > > -Dcom.sun.management.jmxremote.ssl=false > > -Dcom.sun.management.jmxremote.local.only=false" > > > > I was able to look at my Java processes remotely in real-time using > > JConsole. I did not see where the out-of-memory error was happening. > > > > Next, I examined the error logs on the 4-nodes. I searched for FATAL > > entries with the following: > > > > $ pwd > > /home/hadoop2/hadoop-2.6.2/logs > > $ grep -R FATAL * > > > > On the slave nodes, I had log messages such as the following, which seem > to > > indicate the error occurred for the YARN process (NodeManager). > > > > > > > > userlogs/application_1447377156841_0006/container_1447377156841_0006_01_000007/syslog:2015-11-12 > > > > > 17:53:22,581 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error > running > > child : java.lang.OutOfMemoryError: GC overhead limit exceeded > > > > Does anyone have any advice regarding what is causing this error or how I > > can go about generating a heap dump so I can help diagnose the issue? > > > > Thank you, > > > > Deron > > > > > > > > > --001a113dcc821f723c0524c14768--