hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahmood Naderan <nt_mahm...@yahoo.com>
Subject Re: Solving "heap size error"
Date Thu, 13 Mar 2014 11:01:20 GMT
Strange thing is that if I use either -Xmx128m of -Xmx16384m the process stops at the chunk
#571 (571*64=36.5GB).
Still I haven't figured out is this a problem with JVM or Hadoop or Mahout?

I have tested various parameters on 16GB RAM


<property>
<name>mapred.map.child.java.opts</name>
<value>-Xmx2048m</value>

</property>
<property>
<name>mapred.reduce.child.java.opts</name>
<value>-Xmx4096m</value>

</property>

Is there an relation between the parameters and the amount of available memory?
I also see a HADOOP_HEAPSIZE in hadoop-env.sh which is commented by default. What is that?
 
Regards,
Mahmood



On Tuesday, March 11, 2014 11:57 PM, Mahmood Naderan <nt_mahmood@yahoo.com> wrote:
 
As I posted earlier, here is the result of a successful test

5.4GB XML file (which is larger than enwiki-latest-pages-articles10.xml) with 4GB of RAM and
-Xmx128m tooks 5 minutes to complete.

I didn't find a larger wikipedia XML file. Need to test 10GB, 20GB and 30GB files


 
Regards,
Mahmood




On Tuesday, March 11, 2014 11:41 PM, Andrew Musselman <andrew.musselman@gmail.com> wrote:

Can you please try running this on a smaller file first, per Suneel's
comment a while back:

"Please first try running this on a smaller dataset like
'enwiki-latest-pages-articles10.xml' as opposed to running on the entire
english wikipedia."



On Tue, Mar 11, 2014 at 12:56 PM, Mahmood Naderan <nt_mahmood@yahoo.com>wrote:

> Hi,
> Recently I have faced a heap size error when I run
>
>   $MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d
>
$MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o
> wikipedia/chunks -c 64
>
> Here is the specs
> 1- XML file size = 44GB
> 2- System memory = 54GB (on virtualbox)
> 3- Heap size = 51GB (-Xmx51000m)
>
> At the time of failure, I see that 571 chunks are created (hadoop dfs -ls)
> so 36GB of the original file has been processed. Now here are my questions
>
> 1- Is there any way to resume the process? As stated before, 571 chunks
> have been created. So by resuming, it can create the rest of the chunks
> (572~).
>
> 2- Is it possible to parallelize the process? Assume, 100GB of heap is
> required to process the XML file and my system cannot
afford that. Then we
> can create 20 threads each requires 5GB of heap. Next by feeding the first
> 10 threads we can use the available 50GB of heap and after completion, we
> can feed the next set of threads.
>
>
> Regards,
> Mahmood
Mime
View raw message