mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jessie Wright <>
Subject wikipedia bayes quickstart example on EC2 (cloudera)
Date Sat, 01 Mar 2014 17:11:33 GMT

I'm a noob and trying to run the wikipedia bayes example on EC2 (using a
cdh4.5 setup).  I've searched the archives and haven't been able to find
info on this.  I apologize if this is a duplicate question.

The cloudera install comes with Mahout 0.7.

I've run into a few snags on the first step (chunking the data into
pieces).  The first was that it couldn't find  wikipediaXMLSplitter but I
found that substituting org.apache.mahout.text.wikipedia.WikipediaXmlSplitter
in the command it got past that error. (just changing
the capitalization wasn't enough)

However I am now stuck.  I'm getting a java.lang.OutOfMemoryError: Java
heap space error.
I upped MAHOUT_HEAPSIZE to 5000 and am still getting the same error.
See the full error here:  (I added a print
statement to the mahout/bin just to confirm that my export of
MAHOUT_HEAPSIZE was being successfully detected)

I'm wondering whether some other  setting is overriding the
MAHOUT_HEAPSIZE?  One of the hadoop or cloudera specific ones?

Does anyone have any experience with this or suggestions?

Thank you,

Jessie Wright

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message