hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mahmood Naderan <nt_mahm...@yahoo.com>
Subject Re: Solving "heap size error"
Date Tue, 11 Mar 2014 20:26:53 GMT
As I posted earlier, here is the result of a successful test

5.4GB XML file (which is larger than enwiki-latest-pages-articles10.xml) with 4GB of RAM and
-Xmx128m tooks 5 minutes to complete.

I didn't find a larger wikipedia XML file. Need to test 10GB, 20GB and 30GB files


On Tuesday, March 11, 2014 11:41 PM, Andrew Musselman <andrew.musselman@gmail.com> wrote:
Can you please try running this on a smaller file first, per Suneel's
comment a while back:

"Please first try running this on a smaller dataset like
'enwiki-latest-pages-articles10.xml' as opposed to running on the entire
english wikipedia."

On Tue, Mar 11, 2014 at 12:56 PM, Mahmood Naderan <nt_mahmood@yahoo.com>wrote:

> Hi,
> Recently I have faced a heap size error when I run
>   $MAHOUT_HOME/bin/mahout wikipediaXMLSplitter -d
 $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles.xml -o
> wikipedia/chunks -c 64
> Here is the specs
> 1- XML file size = 44GB
> 2- System memory = 54GB (on virtualbox)
> 3- Heap size = 51GB (-Xmx51000m)
> At the time of failure, I see that 571 chunks are created (hadoop dfs -ls)
> so 36GB of the original file has been processed. Now here are my questions
> 1- Is there any way to resume the process? As stated before, 571 chunks
> have been created. So by resuming, it can create the rest of the chunks
> (572~).
> 2- Is it possible to parallelize the process? Assume, 100GB of heap is
> required to process the XML file and my system cannot
 afford that. Then we
> can create 20 threads each requires 5GB of heap. Next by feeding the first
> 10 threads we can use the available 50GB of heap and after completion, we
> can feed the next set of threads.
> Regards,
> Mahmood
View raw message