systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Deron Eriksson <deroneriks...@gmail.com>
Subject Re: Out of memory error and problem generating heap dump
Date Mon, 16 Nov 2015 22:58:03 GMT
Hello Matthias,

Thank you for the help! I'm still running into issues so I was wondering if
you have any further guidance. I think the main question I have is if I am
setting memory and garbage collection options in the right place, since
it's a multi-node and multi-JVM environment.

With regards to your point (1):
I updated my mapred-site.xml mapreduce.map.java.opts property to
"XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-map/" and my
mapreduce.reduce.java.opts property to "XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-reduce/".

If I reran (Kmeans.dml with 1M-row matrix), I had the same errors at this
point, but the log messages provided further useful information now. The
physical memory appeared to be fine but the virtual memory had an issue:
15/11/16 10:36:05 INFO mapreduce.Job: Task Id :
attempt_1447698794207_0001_m_000015_2, Status : FAILED
Container [pid=63900,containerID=container_1447698794207_0001_01_000072] is
running beyond virtual memory limits. Current usage: 165.6 MB of 1 GB
physical memory used; 3.7 GB of 2.1 GB virtual memory used. Killing
container.


Next I looked at points (2) and (3):
I updated mapreduce.map.java.opts to "-server -Xmx2g -Xms2g -Xmn200m
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-map/" and
mapreduce.reduce.java.opts to "-server -Xmx2g -Xms2g -Xmn200m
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-reduce/".

This resulted in the same errors as before.

Am I setting the memory and garbage collection options in the right place
to the right JVMs?

Each node has 12GB RAM and about 60GB (of 144GB) free HD space.

Thanks!
Deron




On Fri, Nov 13, 2015 at 4:17 PM, <mboehm@us.ibm.com> wrote:

> Hi Deron,
>
> couple of things to try out:
>
> 1) Task Configuration: please double check you configuration; if the errors
> are coming from the individual map/reduce tasks, please change
> 'mapreduce.map.java.opts' and 'mapreduce.reduce.java.opts' in your
> mapred-site.xml. The name node / data node configurations don't have any
> effect on the actual tasks.
> 2) Recommended mem config: Normally, we recommend a configuration of -Xmx2g
> -Xms2g -Xmn200m for map/reduce tasks (if this still allows a task/core) w/
> a io.sort.mb of 384 MB for an hdfs blocksize of 128MB. Note the -mn
> parameter, which fixes the size of the young generation; this size also
> affects additional memory overheads - if set to 10% of your max heap, we
> guarantee that your tasks will not run out of memory.
> 3) GC Overhead: You're not getting the OOM because you actually ran out of
> memory but because you spent to much time on garbage collection (because
> you are close to the mem limits). If you're running OpenJDK, it's usually a
> good idea to specify the '-server' flag. If this does not help, you might
> want to increase the number of threads for garbage collection.
> 4) Explain w/ memory estimates: Finally, there is always a possibility of
> bugs too. If the configuration changes above do not solve the problem,
> please run it with "-explain recompile_hops" and subsequently "-explain
> recompile_runtime" which will give you the memory estimates - things to
> look for are broadcast-based operators where the size of vectors exceed the
> budgets of your tasks and instructions that generate large outputs.
>
>
> Regards,
> Matthias
>
>
>
>
>
> From:   Deron Eriksson <deroneriksson@gmail.com>
> To:     dev@systemml.incubator.apache.org
> Date:   11/13/2015 03:25 PM
> Subject:        Out of memory error and problem generating heap dump
>
>
>
> Hello,
>
> I'm running into an out-of-memory issue when I attempt to use the
> Kmeans.dml algorithm on a 1M-row matrix of generated test data. I am trying
> to generate a heap dump in order to help diagnose the problem but so far I
> haven't been able to correctly generate a heap dump file. I was wondering
> if anyone has any advice regarding the out-of-memory issue and creating a
> heap dump to help diagnose the problem.
>
> I set up a 4-node Hadoop cluster (on Red Hat Enterprise Linux Server
> release 6.6 (Santiago)) with HDFS and YARN to try out SystemML in Hadoop
> batch mode. The master node has NameNode, SecondaryNameNode, and
> ResourceManager daemons running on it. The 3 other nodes have DataNode and
> NodeManager daemons running on them.
>
> I'm trying out the Kmeans.dml algorithm. To begin, I generated test data
> using the genRandData4Kmeans.dml script with 100K rows via:
>
> hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs
> nr=100000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=Xsmall.mtx
> C=Csmall.mtx Y=Ysmall.mtx YbyC=YbyCsmall.mtx
>
> Next, I ran Kmeans.dml against the Xsmall.mtx 100K-row matrix via:
>
> hadoop jar system-ml-0.8.0/SystemML.jar -f
> system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=Xsmall.mtx k=5
>
> This ran perfectly.
>
> However, next I increased the amount of test data to 1M rows, which
> resulted in matrix data of about 3GB in size:
>
> hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs
> nr=1000000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=X.mtx C=C.mtx
> Y=Y.mtx YbyC=YbyC.mtx
>
> I ran Kmeans.dml against the 1M-row X.mtx matrix via:
>
> hadoop jar system-ml-0.8.0/SystemML.jar -f
> system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=X.mtx k=5
>
> In my console, I received a number of error messages such as:
>
> Error: Java heap space
> 15/11/13 14:48:58 INFO mapreduce.Job: Task Id :
> attempt_1447452404596_0006_m_000023_1, Status : FAILED
> Error: GC overhead limit exceeded
>
> Next, I attempted to generate a heap dump. Additionally, I added some
> settings so that I could look at memory usage remotely using JConsole.
>
> I added the following lines to my hadoop-env.sh files on each node:
>
> export HADOOP_NAMENODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/
> -Dcom.sun.management.jmxremote.port=9999
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.local.only=false ${HADOOP_NAMENODE_OPTS}"
>
> export HADOOP_DATANODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/
> -Dcom.sun.management.jmxremote.port=9999
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.local.only=false ${HADOOP_DATANODE_OPTS}"
>
> I added the following to my yarn-env.sh files on each node:
>
> export YARN_RESOURCEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/
> -Dcom.sun.management.jmxremote.port=9998
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.local.only=false
> ${YARN_RESOURCEMANAGER_OPTS}"
>
> export YARN_NODEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/
> -Dcom.sun.management.jmxremote.port=9998
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.local.only=false ${YARN_NODEMANAGER_OPTS}"
>
> Additionally, I modified the bin/hadoop file:
>
> HADOOP_OPTS="$HADOOP_OPTS -XX:+HeapDumpOnOutOfMemoryError
> -XX:HeapDumpPath=/home/hadoop2/heapdumps/
> -Dcom.sun.management.jmxremote.port=9997
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.local.only=false"
>
> I was able to look at my Java processes remotely in real-time using
> JConsole. I did not see where the out-of-memory error was happening.
>
> Next, I examined the error logs on the 4-nodes. I searched for FATAL
> entries with the following:
>
> $ pwd
> /home/hadoop2/hadoop-2.6.2/logs
> $ grep -R FATAL *
>
> On the slave nodes, I had log messages such as the following, which seem to
> indicate the error occurred for the YARN process (NodeManager).
>
>
> userlogs/application_1447377156841_0006/container_1447377156841_0006_01_000007/syslog:2015-11-12
>
> 17:53:22,581 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running
> child : java.lang.OutOfMemoryError: GC overhead limit exceeded
>
> Does anyone have any advice regarding what is causing this error or how I
> can go about generating a heap dump so I can help diagnose the issue?
>
> Thank you,
>
> Deron
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message