mahout-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicholas Kolegraff <nickkolegr...@gmail.com>
Subject Re: Parallel ALS-WR on very large matrix -- crashing (I think)
Date Thu, 02 Feb 2012 01:23:34 GMT
Thanks for the prompt reply Kate!

The cluster has since been torn down on EC2 but, I did monitor it during
the job execution and all seemed to be ok.  JobTracker and NameNode would
continue to report status.

I was aware of the configuration setting and hoping to refrain from playing
with it :-) I get scared to modify it too large, since that time could get
unnecessarily charged to my EC2 account. :S

Do you know if it should still report status in the midst of a complex
task?  Seems questionable that it wouldn't just send a friendly hello?

On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson <ericson@cs.colostate.edu>wrote:

> Hi,
>
> This *may* just be a Hadoop issue - it sounds like the JobTracker is
> upset that it hasn't heard from one of the workers in too long (over
> 600 seconds).
> Can you check your Hadoop Administration pages for the cluster?  Does
> the cluster still seem to be functioning?
> I haven't used Hadoop with EC2, so I'm not sure how difficult it will
> be to check the cluster :-/
> If everything seems to be OK, there's a Hadoop setting to modify how
> long it's willing to wait before assuming a machine has failed and
> killing a task.
>
>
> -Kate
>
> On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff
> <nickkolegraff@gmail.com> wrote:
> > Hello,
> > I am attempting to run parallelALS on a very large matrix on EC2.
> > The matrix is ~8 Million x 1 million. vary sparse .007% has data.
> > I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge)
> > (I kept getting OutOfMemory exceptions so I kept upping the ante until I
> > arrived at the above configuration)
> >
> > It makes it through the following jobs no problem:
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
> > ....
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
> >
> > Then crashes here with only the following error messages:
> > Task attempt_201201311814_0023_m_000000_0 failed to report status for 600
> > seconds. Killing!
> >
> > Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's
> > status?
> >
> > I'm not sure what is causing this -- I am still trying to wrap my head
> > around the mahout API.
> >
> > Could this still be a memory issue?
> >
> > Hopefully i'm not missing something trivial?!?!
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message