mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Justin Kay ...@easyesi.com>
Subject Re: OutOfMemoryError: Java Heap Space in DocumentProcessor.tokenizeDocuments
Date Mon, 24 Feb 2014 22:26:29 GMT
Thanks, that seemed to help with passing in the parameters but I'm still
running into the same problem with the job. It's getting stuck on Map 0%
Reduce 0% when tokenizing documents (DocumentProcessor.tokenizeDocuments)
and then throws a "java.lang.OutOfMemoryError: GC overhead limit exceeded"
caused by running out of heap space. (I've tried running it with
the -XX:-UseGCOverheadLimit option and it just gives me the same Java heap
error.)

I've also tried running it with Hadoop 1.2.1 and Mahout 0.8 and had the
same problem.


On Sat, Feb 22, 2014 at 12:22 PM, Johannes Schulte <
johannes.schulte@gmail.com> wrote:

> 1I would pass the memory parameters in the args array directly. The hadoop
> specific arguments must come before your custom arguments, so like this
>
> String[] args = new String[]{"-Dmapreduce.map.memory.mb=12323","customOpt1"
> ToolRunner.run(..args)
>
> The tool runner takes care of putting the hadoop specific arguments in the
> jobs configs and. I bet the configuration you use is overridden or replaced
> by something else.
>
> Other than that, there is also
>
> job.getConfiguration().set("mapred.map.child.java.opts", "-Xmx2G");
>
>
> which works for me, but this is dependent on the hadoop version i guess.
>
>
>
>
> On Thu, Feb 20, 2014 at 9:15 PM, Justin Kay <jk@easyesi.com> wrote:
>
> > Hi everyone,
> >
> > I've been stuck on an OutOfMemoryError when attempting to run a
> > SparseVectorsFromSequenceFiles() Job in Java. I'm using Mahout 0.9 and
> > Hadoop 2.2, run in a Maven project. I've tried setting the heap
> > configurations through Java using a Hadoop Configuration that is passed
> to
> > the Job:
> >
> > CONF.set("mapreduce.map.memory.mb", "1536");
> > CONF.set("mapreduce.map.java.opts", "-Xmx1024m");
> > CONF.set("mapreduce.reduce.memory.mb", "1536");
> > CONF.set("mapreduce.reduce.java.opts", "-Xmx1024m");
> > CONF.set("task.io.sort.mb", "512");
> > CONF.set("task.io.sort.factor", "100");
> >
> > etc., but nothing has seemed to work. My Java heap settings are similar
> and
> > configured to "-Xms512m -Xmx1536m" when running the project. The data I'm
> > using is 100,000 sequence files totally ~250mb. It doesn't fail on a data
> > set of 63 sequence files ~2mb. Here is an example stack trace:
> >
> > Exception in thread "Thread-18" java.lang.OutOfMemoryError: Java heap
> space
> > at sun.util.resources.TimeZoneNames.getContents(TimeZoneNames.java:205)
> > at
> >
> >
> sun.util.resources.OpenListResourceBundle.loadLookup(OpenListResourceBundle.java:125)
> > at
> >
> >
> sun.util.resources.OpenListResourceBundle.loadLookupTablesIfNecessary(OpenListResourceBundle.java:113)
> > (this seems to get thrown on different bits of code every time)
> > ......
> > java.lang.IllegalStateException: Job failed!
> > at
> >
> >
> org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
> > at
> >
> >
> org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> >
> > This is the code I'm running it with in order to pass in my own
> > Configuration:
> >
> > SparseVectorsFromSequenceFiles VectorizeJob = new
> > SparseVectorsFromSequenceFiles();
> > VectorizeJob.setConf(CONF);
> > ToolRunner.run(VectorizeJob, args);, where args is a String[] of command
> > line options
> >
> > Any suggestions would be greatly appreciated.
> >
> > Justin Kay
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message