systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From mbo...@us.ibm.com
Subject Re: Out of memory error and problem generating heap dump
Date Sat, 14 Nov 2015 00:17:43 GMT
Hi Deron,

couple of things to try out:

1) Task Configuration: please double check you configuration; if the errors
are coming from the individual map/reduce tasks, please change
'mapreduce.map.java.opts' and 'mapreduce.reduce.java.opts' in your
mapred-site.xml. The name node / data node configurations don't have any
effect on the actual tasks.
2) Recommended mem config: Normally, we recommend a configuration of -Xmx2g
-Xms2g -Xmn200m for map/reduce tasks (if this still allows a task/core) w/
a io.sort.mb of 384 MB for an hdfs blocksize of 128MB. Note the -mn
parameter, which fixes the size of the young generation; this size also
affects additional memory overheads - if set to 10% of your max heap, we
guarantee that your tasks will not run out of memory.
3) GC Overhead: You're not getting the OOM because you actually ran out of
memory but because you spent to much time on garbage collection (because
you are close to the mem limits). If you're running OpenJDK, it's usually a
good idea to specify the '-server' flag. If this does not help, you might
want to increase the number of threads for garbage collection.
4) Explain w/ memory estimates: Finally, there is always a possibility of
bugs too. If the configuration changes above do not solve the problem,
please run it with "-explain recompile_hops" and subsequently "-explain
recompile_runtime" which will give you the memory estimates - things to
look for are broadcast-based operators where the size of vectors exceed the
budgets of your tasks and instructions that generate large outputs.


Regards,
Matthias





From:	Deron Eriksson <deroneriksson@gmail.com>
To:	dev@systemml.incubator.apache.org
Date:	11/13/2015 03:25 PM
Subject:	Out of memory error and problem generating heap dump



Hello,

I'm running into an out-of-memory issue when I attempt to use the
Kmeans.dml algorithm on a 1M-row matrix of generated test data. I am trying
to generate a heap dump in order to help diagnose the problem but so far I
haven't been able to correctly generate a heap dump file. I was wondering
if anyone has any advice regarding the out-of-memory issue and creating a
heap dump to help diagnose the problem.

I set up a 4-node Hadoop cluster (on Red Hat Enterprise Linux Server
release 6.6 (Santiago)) with HDFS and YARN to try out SystemML in Hadoop
batch mode. The master node has NameNode, SecondaryNameNode, and
ResourceManager daemons running on it. The 3 other nodes have DataNode and
NodeManager daemons running on them.

I'm trying out the Kmeans.dml algorithm. To begin, I generated test data
using the genRandData4Kmeans.dml script with 100K rows via:

hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs
nr=100000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=Xsmall.mtx
C=Csmall.mtx Y=Ysmall.mtx YbyC=YbyCsmall.mtx

Next, I ran Kmeans.dml against the Xsmall.mtx 100K-row matrix via:

hadoop jar system-ml-0.8.0/SystemML.jar -f
system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=Xsmall.mtx k=5

This ran perfectly.

However, next I increased the amount of test data to 1M rows, which
resulted in matrix data of about 3GB in size:

hadoop jar system-ml-0.8.0/SystemML.jar -f genRandData4Kmeans.dml -nvargs
nr=1000000 nf=100 nc=10 dc=10.0 dr=1.0 fbf=100.0 cbf=100.0 X=X.mtx C=C.mtx
Y=Y.mtx YbyC=YbyC.mtx

I ran Kmeans.dml against the 1M-row X.mtx matrix via:

hadoop jar system-ml-0.8.0/SystemML.jar -f
system-ml-0.8.0/algorithms/Kmeans.dml -nvargs X=X.mtx k=5

In my console, I received a number of error messages such as:

Error: Java heap space
15/11/13 14:48:58 INFO mapreduce.Job: Task Id :
attempt_1447452404596_0006_m_000023_1, Status : FAILED
Error: GC overhead limit exceeded

Next, I attempted to generate a heap dump. Additionally, I added some
settings so that I could look at memory usage remotely using JConsole.

I added the following lines to my hadoop-env.sh files on each node:

export HADOOP_NAMENODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false ${HADOOP_NAMENODE_OPTS}"

export HADOOP_DATANODE_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-dfs/
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false ${HADOOP_DATANODE_OPTS}"

I added the following to my yarn-env.sh files on each node:

export YARN_RESOURCEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/
-Dcom.sun.management.jmxremote.port=9998
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false
${YARN_RESOURCEMANAGER_OPTS}"

export YARN_NODEMANAGER_OPTS="-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps-yarn/
-Dcom.sun.management.jmxremote.port=9998
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false ${YARN_NODEMANAGER_OPTS}"

Additionally, I modified the bin/hadoop file:

HADOOP_OPTS="$HADOOP_OPTS -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/home/hadoop2/heapdumps/
-Dcom.sun.management.jmxremote.port=9997
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.local.only=false"

I was able to look at my Java processes remotely in real-time using
JConsole. I did not see where the out-of-memory error was happening.

Next, I examined the error logs on the 4-nodes. I searched for FATAL
entries with the following:

$ pwd
/home/hadoop2/hadoop-2.6.2/logs
$ grep -R FATAL *

On the slave nodes, I had log messages such as the following, which seem to
indicate the error occurred for the YARN process (NodeManager).

userlogs/application_1447377156841_0006/container_1447377156841_0006_01_000007/syslog:2015-11-12

17:53:22,581 FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running
child : java.lang.OutOfMemoryError: GC overhead limit exceeded

Does anyone have any advice regarding what is causing this error or how I
can go about generating a heap dump so I can help diagnose the issue?

Thank you,

Deron



Mime
View raw message