nutch-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Problems with HeapSpace in Hadoop Cluster
Date Tue, 06 Dec 2011 09:17:35 GMT
Do your unmerged segments contain a lot of duplicates? If not, then don't 
merge. It is not required anymore and takes a lot of time. Technically there 
is no reason to merge segments and the WebGraph problem already has a -
segmentDir option in 1.4.

Otherwise, increase mapper heap space and decrease reducer head space.

> Hello folks,
> 
> excuse me if my topic is half off topic, because it regards mostly the
> hadoop setup, but perhaps someone had the same problems already.
> 
> The situation is as follows:
> 
> I am merging the segments together after every crawl cycle for
> calculating the WebGraph easily.
> 
> At the moment I have three segements and two of them are very large (but
> I think not in the matter of hadoop ;-) )
> 
> The first seg is 22,9 GB, the second seg is 23,3 GB, and the third seg
> is (only) 528 MB.
> 
> When I try to merge this this three segs together the job crashes after
> a while after a couple of heapSpace errors, see beneath
> 
> 11/12/06 00:20:53 INFO mapred.JobClient: Task Id :
> attempt_201112052355_0002_r_000003_0, Status : FAILED
> Error: java.lang.OutOfMemoryError: Java heap space
>         at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInM
> emory(ReduceTask.java:1508) at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutp
> ut(ReduceTask.java:1408) at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput
> (ReduceTask.java:1261) at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(Reduce
> Task.java:1195)
> 
> My Hadoop-Cluster consists of a master node with 4 GB Ram and a dual
> core CPU with 2 GHz per core.
> There are five identical slaves with 1.5 GB Ram and dual core cpus with
> 3.0 GHz per core.
> 
> I set HADOOP_HEAPSIZE to 1500 MB and in mapred-site.xml:
> 
>   <property>
>     <name>mapred.child.java.opts</name>
>     <value>-Xmx512m</value>
>     <description>
>       You can specify other Java options for each map or reduce task here,
>       but most likely you will want to adjust the heap size.
>     </description>
>   </property>
> 
>   <property>
>     <name>mapred.map.tasks</name>
>     <value>50</value>
>     <description>
>       define mapred.map tasks to be number of slave hosts
>     </description>
>   </property>
> 
>   <property>
>     <name>mapred.reduce.tasks</name>
>     <value>6</value>
>     <description>
>       define mapred.reduce tasks to be number of slave hosts
>     </description>
>   </property>
> 
> But with this value I can't run the merger with about failing the job
> because of the HeapSpace errors. Any idea if I could solve the problem
> by adjusting the configuration, or do I just need more RAM for this job?
> 
> Thank you very much!
> 
> Marek

Mime
  • Unnamed multipart/alternative (inline, 7-Bit, 0 bytes)
View raw message