Mailing-List: contact user-help@mahout.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@mahout.apache.org
Received-SPF: pass (nike.apache.org: domain of nickkolegraff@gmail.com
 designates 74.125.82.50 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAL4To6-3eAKoyA-X-iiBTQe16TwnOJYNjbskmpndPE_Ei5yc+Q@mail.gmail.com>
References: 
 <CADZOM5Y6+iFG_XMmrsgVVXfpmjkn-jDqiy3drjwCcb8AracDTQ@mail.gmail.com>
	<CADZOM5aVgvKHuy4bmnfw4HNvMGCD8siD0MeTDSpAAUORRRv6ig@mail.gmail.com>
	<CAL4To6-3eAKoyA-X-iiBTQe16TwnOJYNjbskmpndPE_Ei5yc+Q@mail.gmail.com>
Date: Wed, 1 Feb 2012 17:23:34 -0800
Message-ID: 
 <CADZOM5ZgUprSii_e0vtdmWeiAJ3M3OFwqBS_a-5Xc5PZ3YJ0PA@mail.gmail.com>
Subject: Re: Parallel ALS-WR on very large matrix -- crashing (I think)
From: Nicholas Kolegraff <nickkolegraff@gmail.com>
To: user@mahout.apache.org
Content-Type: multipart/alternative; boundary=f46d043be25680040f04b7f10b18

--f46d043be25680040f04b7f10b18
Content-Type: text/plain; charset=ISO-8859-1

Thanks for the prompt reply Kate!

The cluster has since been torn down on EC2 but, I did monitor it during
the job execution and all seemed to be ok.  JobTracker and NameNode would
continue to report status.

I was aware of the configuration setting and hoping to refrain from playing
with it :-) I get scared to modify it too large, since that time could get
unnecessarily charged to my EC2 account. :S

Do you know if it should still report status in the midst of a complex
task?  Seems questionable that it wouldn't just send a friendly hello?

On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson <ericson@cs.colostate.edu>wrote:

> Hi,
>
> This *may* just be a Hadoop issue - it sounds like the JobTracker is
> upset that it hasn't heard from one of the workers in too long (over
> 600 seconds).
> Can you check your Hadoop Administration pages for the cluster?  Does
> the cluster still seem to be functioning?
> I haven't used Hadoop with EC2, so I'm not sure how difficult it will
> be to check the cluster :-/
> If everything seems to be OK, there's a Hadoop setting to modify how
> long it's willing to wait before assuming a machine has failed and
> killing a task.
>
>
> -Kate
>
> On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff
> <nickkolegraff@gmail.com> wrote:
> > Hello,
> > I am attempting to run parallelALS on a very large matrix on EC2.
> > The matrix is ~8 Million x 1 million. vary sparse .007% has data.
> > I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge)
> > (I kept getting OutOfMemory exceptions so I kept upping the ante until I
> > arrived at the above configuration)
> >
> > It makes it through the following jobs no problem:
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
> > ....
> >
> ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM
> >
> > Then crashes here with only the following error messages:
> > Task attempt_201201311814_0023_m_000000_0 failed to report status for 600
> > seconds. Killing!
> >
> > Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's
> > status?
> >
> > I'm not sure what is causing this -- I am still trying to wrap my head
> > around the mahout API.
> >
> > Could this still be a memory issue?
> >
> > Hopefully i'm not missing something trivial?!?!
>

--f46d043be25680040f04b7f10b18--