Mailing-List: contact dev-help@systemml.incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@systemml.incubator.apache.org
MIME-Version: 1.0
In-Reply-To: <201511170024.tAH0O9gQ012036@d03av01.boulder.ibm.com>
References: 
 <CAO1aQR9A2ze9HZAf9cqr-hv+qfWyyyZ0q1TByH7hfL2oRM4ndg@mail.gmail.com>
	<201511140017.tAE0HslM009160@d03av05.boulder.ibm.com>
	<CAO1aQR820MH8TLauHzu2ABrdDRVn4NH7rTnf9FXpvek6W5V+kg@mail.gmail.com>
	<201511170024.tAH0O9gQ012036@d03av01.boulder.ibm.com>
Date: Tue, 17 Nov 2015 11:10:23 -0800
Message-ID: 
 <CAO1aQR9u63ueqqn3a=uKY+SgPBnLfwvUn3-MESLeZavyfJNUEg@mail.gmail.com>
Subject: Re: Out of memory error and problem generating heap dump
From: Deron Eriksson <deroneriksson@gmail.com>
To: dev@systemml.incubator.apache.org
Content-Type: multipart/alternative; boundary=001a113dcc821f723c0524c14768

--001a113dcc821f723c0524c14768
Content-Type: text/plain; charset=UTF-8

Hello,

Thank you for the help, Matthias. Explicitly bumping up
"mapreduce.map.memory.mb" and "mapreduce.reduce.memory.mb" in
mapred-site.xml took care of the memory issues that I had been hitting with
Kmeans in Hadoop batch mode.

Deron


On Mon, Nov 16, 2015 at 4:24 PM, <mboehm@us.ibm.com> wrote:

> well, I think you're on the right track but your cluster configuration
> still has a couple of issues.
>
> The error tells us that you're not actually running out of memory but your
> tasks are killed by the node managers because you are exceeding the
> allocated virtual container memory. So here are a couple of things to
> check:
>
> 1) Consistent container configuration: You already modified jvm options for
> map/reduce tasks (e.g., mapreduce.map.java.opts). View them as
> configurations of your actual processes. In addition, you have to ensure
> that you request consistent container resources for these tasks. Please,
> double check in mapred-site.xml the 'mapreduce.map.memory.mb' and
> 'mapreduce.reduce.memory.mb' (the mapred AM request container resources
> according to these configurations, which also need to cover JVM overheads)
> - I usually configure them conservatively to 1.5x the max heap
> configuration of my tasks.
>
> 2) Virtual memory configuration: Also, please ensure that you allow a
> sufficiently large ratio between allocated virtual and physical memory.
> Overcommitting virtual memory is fine. Please check  in yarn-site.xml the
> following property: 'yarn.nodemanager.vmem-pmem-ratio' - I usually
> configure this to something between 2 and 5. If this does not solve your
> problem, you can also disable that your task processes are killed in these
> situations by setting 'yarn.nodemanager.vmem-check-enabled' to false.
>
> Regards,
> Matthias
>
>
>
>
>
> From:   Deron Eriksson <deroneriksson@gmail.com>
> To:     dev@systemml.incubator.apache.org
> Date:   11/16/2015 02:58 PM
> Subject:        Re: Out of memory error and problem generating heap dump
>
>
>
> Hello Matthias,
>
> Thank you for the help! I'm still running into issues so I was wondering if
> you have any further guidance. I think the main question I have is if I am
> setting memory and garbage collection options in the right place, since
> it's a multi-node and multi-JVM environment.
>
> With regards to your point (1):
> I updated my mapred-site.xml mapreduce.map.java.opts property to
> "XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps-map/" and my
> mapreduce.reduce.java.opts property to "XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps-reduce/".
>
> If I reran (Kmeans.dml with 1M-row matrix), I had the same errors at this
> point, but the log messages provided further useful information now. The
> physical memory appeared to be fine but the virtual memory had an issue:
> 15/11/16 10:36:05 INFO mapreduce.Job: Task Id :
> attempt_1447698794207_0001_m_000015_2, Status : FAILED
> Container [pid=63900,containerID=container_1447698794207_0001_01_000072] is
> running beyond virtual memory limits. Current usage: 165.6 MB of 1 GB
> physical memory used; 3.7 GB of 2.1 GB virtual memory used. Killing
> container.
>
>
> Next I looked at points (2) and (3):
> I updated mapreduce.map.java.opts to "-server -Xmx2g -Xms2g -Xmn200m
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps-map/" and
> mapreduce.reduce.java.opts to "-server -Xmx2g -Xms2g -Xmn200m
> -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps-reduce/".
>
> This resulted in the same errors as before.
>
> Am I setting the memory and garbage collection options in the right place
> to the right JVMs?
>
> Each node has 12GB RAM and about 60GB (of 144GB) free HD space.
>
> Thanks!
> Deron
>
>
>
>
> On Fri, Nov 13, 2015 at 4:17 PM, <mboehm@us.ibm.com> wrote:
>
> > Hi Deron,
> >
> > couple of things to try out:
> >
> > 1) Task Configuration: please double check you configuration; if the
> errors
> > are coming from the individual map/reduce tasks, please change
> > 'mapreduce.map.java.opts' and 'mapreduce.reduce.java.opts' in your
> > mapred-site.xml. The name node / data node configurations don't have any
> > effect on the actual tasks.
> > 2) Recommended mem config: Normally, we recommend a configuration of
> -Xmx2g
> > -Xms2g -Xmn200m for map/reduce tasks (if this still allows a task/core)
> w/
> > a io.sort.mb of 384 MB for an hdfs blocksize of 128MB. Note the -mn
> > parameter, which fixes the size of the young generation; this size also
> > affects additional memory overheads - if set to 10% of your max heap, we
> > guarantee that your tasks will not run out of memory.
> > 3) GC Overhead: You're not getting the OOM because you actually ran out
> of
> > memory but because you spent to much time on garbage collection (because
> > you are close to the mem limits). If you're running OpenJDK, it's usually
> a
> > good idea to specify the '-server' flag. If this does not help, you might
> > want to increase the number of threads for garbage collection.
> > 4) Explain w/ memory estimates: Finally, there is always a possibility of
> > bugs too. If the configuration changes above do not solve the problem,
> > please run it with "-explain recompile_hops" and subsequently "-explain
> > recompile_runtime" which will give you the memory estimates - things to
> > look for are broadcast-based operators where the size of vectors exceed
> the
> > budgets of your tasks and instructions that generate large outputs.
> >
> >
> > Regards,
> > Matthias
> >
> >
> >
> >
> >
> > From:   Deron Eriksson <deroneriksson@gmail.com>
> > To:     dev@systemml.incubator.apache.org
> > Date:   11/13/2015 03:25 PM
> > Subject:        Out of memory error and problem generating heap dump
> >
> >
> >
> > Hello,
> >
> > I'm running into an out-of-memory issue when I attempt to use the
> > Kmeans.dml algorithm on a 1M-row matrix of generated test data. I am
> trying
> > to generate a heap dump in order to help diagnose the problem but so far
> I
> > haven't been able to correctly generate a heap dump file. I was wondering
> > if anyone has any advice regarding the out-of-memory issue and creating a
> > heap dump to help diagnose the problem.
> >
> > I set up a 4-node Hadoop cluster (on Red Hat Enterprise Linux Server
> > release 6.6 (Santiago)) with HDFS and YARN to try out SystemML in Hadoop
> > batch mode. The master node has NameNode, SecondaryNameNode, and
> > ResourceManager daemons running on it. The 3 other nodes have DataNode
> and
> > NodeManager daemons running on them.
> >
> > I'm trying out the Kmeans.dml algorithm. To begin, I generated test data
> > using the genRandData4Kmeans.dml script with 100K rows via:
> >
> > hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs
> > nr=100000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=Xsmall.mtx
> > C=Csmall.mtx Y=Ysmall.mtx YbyC=YbyCsmall.mtx
> >
> > Next, I ran Kmeans.dml against the Xsmall.mtx 100K-row matrix via:
> >
> > hadoop jar system-ml-0.8.0/SystemML.jar -f
> > system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=Xsmall.mtx k=5
> >
> > This ran perfectly.
> >
> > However, next I increased the amount of test data to 1M rows, which
> > resulted in matrix data of about 3GB in size:
> >
> > hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs
> > nr=1000000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=X.mtx
> C=C.mtx
> > Y=Y.mtx YbyC=YbyC.mtx
> >
> > I ran Kmeans.dml against the 1M-row X.mtx matrix via:
> >
> > hadoop jar system-ml-0.8.0/SystemML.jar -f
> > system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=X.mtx k=5
> >
> > In my console, I received a number of error messages such as:
> >
> > Error: Java heap space
> > 15/11/13 14:48:58 INFO mapreduce.Job: Task Id :
> > attempt_1447452404596_0006_m_000023_1, Status : FAILED
> > Error: GC overhead limit exceeded
> >
> > Next, I attempted to generate a heap dump. Additionally, I added some
> > settings so that I could look at memory usage remotely using JConsole.
> >
> > I added the following lines to my hadoop-env.sh files on each node:
> >
> > export HADOOP_NAMENODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError
> > -XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/
> > -Dcom.sun.management.jmxremote.port=9999
> > -Dcom.sun.management.jmxremote.authenticate=false
> > -Dcom.sun.management.jmxremote.ssl=false
> > -Dcom.sun.management.jmxremote.local.only=false ${HADOOP_NAMENODE_OPTS}"
> >
> > export HADOOP_DATANODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError
> > -XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/
> > -Dcom.sun.management.jmxremote.port=9999
> > -Dcom.sun.management.jmxremote.authenticate=false
> > -Dcom.sun.management.jmxremote.ssl=false
> > -Dcom.sun.management.jmxremote.local.only=false ${HADOOP_DATANODE_OPTS}"
> >
> > I added the following to my yarn-env.sh files on each node:
> >
> > export YARN_RESOURCEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError
> > -XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/
> > -Dcom.sun.management.jmxremote.port=9998
> > -Dcom.sun.management.jmxremote.authenticate=false
> > -Dcom.sun.management.jmxremote.ssl=false
> > -Dcom.sun.management.jmxremote.local.only=false
> > ${YARN_RESOURCEMANAGER_OPTS}"
> >
> > export YARN_NODEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError
> > -XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/
> > -Dcom.sun.management.jmxremote.port=9998
> > -Dcom.sun.management.jmxremote.authenticate=false
> > -Dcom.sun.management.jmxremote.ssl=false
> > -Dcom.sun.management.jmxremote.local.only=false ${YARN_NODEMANAGER_OPTS}"
> >
> > Additionally, I modified the bin/hadoop file:
> >
> > HADOOP_OPTS="$HADOOP_OPTS -XX:+HeapDumpOnOutOfMemoryError
> > -XX:HeapDumpPath=/home/hadoop2/heapdumps/
> > -Dcom.sun.management.jmxremote.port=9997
> > -Dcom.sun.management.jmxremote.authenticate=false
> > -Dcom.sun.management.jmxremote.ssl=false
> > -Dcom.sun.management.jmxremote.local.only=false"
> >
> > I was able to look at my Java processes remotely in real-time using
> > JConsole. I did not see where the out-of-memory error was happening.
> >
> > Next, I examined the error logs on the 4-nodes. I searched for FATAL
> > entries with the following:
> >
> > $ pwd
> > /home/hadoop2/hadoop-2.6.2/logs
> > $ grep -R FATAL *
> >
> > On the slave nodes, I had log messages such as the following, which seem
> to
> > indicate the error occurred for the YARN process (NodeManager).
> >
> >
> >
>
> userlogs/application_1447377156841_0006/container_1447377156841_0006_01_000007/syslog:2015-11-12
>
> >
> > 17:53:22,581 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error
> running
> > child : java.lang.OutOfMemoryError: GC overhead limit exceeded
> >
> > Does anyone have any advice regarding what is causing this error or how I
> > can go about generating a heap dump so I can help diagnose the issue?
> >
> > Thank you,
> >
> > Deron
> >
> >
> >
>
>
>

--001a113dcc821f723c0524c14768--