nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marek Bachmann <m.bachm...@uni-kassel.de>
Subject Problems with HeapSpace in Hadoop Cluster
Date Mon, 05 Dec 2011 23:37:26 GMT
Hello folks,

excuse me if my topic is half off topic, because it regards mostly the
hadoop setup, but perhaps someone had the same problems already.

The situation is as follows:

I am merging the segments together after every crawl cycle for
calculating the WebGraph easily.

At the moment I have three segements and two of them are very large (but
I think not in the matter of hadoop ;-) )

The first seg is 22,9 GB, the second seg is 23,3 GB, and the third seg
is (only) 528 MB.

When I try to merge this this three segs together the job crashes after
a while after a couple of heapSpace errors, see beneath

11/12/06 00:20:53 INFO mapred.JobClient: Task Id :
attempt_201112052355_0002_r_000003_0, Status : FAILED
Error: java.lang.OutOfMemoryError: Java heap space
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
        at
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)

My Hadoop-Cluster consists of a master node with 4 GB Ram and a dual
core CPU with 2 GHz per core.
There are five identical slaves with 1.5 GB Ram and dual core cpus with
3.0 GHz per core.

I set HADOOP_HEAPSIZE to 1500 MB and in mapred-site.xml:

  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx512m</value>
    <description>
      You can specify other Java options for each map or reduce task here,
      but most likely you will want to adjust the heap size.
    </description>
  </property>

  <property>
    <name>mapred.map.tasks</name>
    <value>50</value>
    <description>
      define mapred.map tasks to be number of slave hosts
    </description>
  </property>

  <property>
    <name>mapred.reduce.tasks</name>
    <value>6</value>
    <description>
      define mapred.reduce tasks to be number of slave hosts
    </description>
  </property>

But with this value I can't run the merger with about failing the job
because of the HeapSpace errors. Any idea if I could solve the problem
by adjusting the configuration, or do I just need more RAM for this job?

Thank you very much!

Marek

Mime
View raw message