Return-Path: X-Original-To: apmail-mahout-user-archive@www.apache.org Delivered-To: apmail-mahout-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1E22E9F7B for ; Thu, 2 Feb 2012 01:24:04 +0000 (UTC) Received: (qmail 4671 invoked by uid 500); 2 Feb 2012 01:24:02 -0000 Delivered-To: apmail-mahout-user-archive@mahout.apache.org Received: (qmail 4610 invoked by uid 500); 2 Feb 2012 01:24:02 -0000 Mailing-List: contact user-help@mahout.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@mahout.apache.org Delivered-To: mailing list user@mahout.apache.org Received: (qmail 4602 invoked by uid 99); 2 Feb 2012 01:24:01 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Feb 2012 01:24:01 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of nickkolegraff@gmail.com designates 74.125.82.50 as permitted sender) Received: from [74.125.82.50] (HELO mail-ww0-f50.google.com) (74.125.82.50) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 02 Feb 2012 01:23:54 +0000 Received: by wgbdq11 with SMTP id dq11so1822279wgb.7 for ; Wed, 01 Feb 2012 17:23:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=9OiRRYrtXGUGYU1Uv21ZSahNfk1buViwt4FNJZGaBuU=; b=qmC6yR1+NDeXX6tovSPJ0vt07dR0JMvpN7Q25IJRCSAKjy5xSwgOk92kFn97CmpL38 ly12ijs6ows0hVH3FcgEVemDzWwY1p3B9sTn6hYGVEbNvTqdIMO1cvNk45/V4T95mg+F F5I2RXdoDxxP0yqCFIFftlyHAg10eLdOFfWXs= MIME-Version: 1.0 Received: by 10.180.90.212 with SMTP id by20mr1323615wib.12.1328145814782; Wed, 01 Feb 2012 17:23:34 -0800 (PST) Received: by 10.180.97.169 with HTTP; Wed, 1 Feb 2012 17:23:34 -0800 (PST) In-Reply-To: References: Date: Wed, 1 Feb 2012 17:23:34 -0800 Message-ID: Subject: Re: Parallel ALS-WR on very large matrix -- crashing (I think) From: Nicholas Kolegraff To: user@mahout.apache.org Content-Type: multipart/alternative; boundary=f46d043be25680040f04b7f10b18 X-Virus-Checked: Checked by ClamAV on apache.org --f46d043be25680040f04b7f10b18 Content-Type: text/plain; charset=ISO-8859-1 Thanks for the prompt reply Kate! The cluster has since been torn down on EC2 but, I did monitor it during the job execution and all seemed to be ok. JobTracker and NameNode would continue to report status. I was aware of the configuration setting and hoping to refrain from playing with it :-) I get scared to modify it too large, since that time could get unnecessarily charged to my EC2 account. :S Do you know if it should still report status in the midst of a complex task? Seems questionable that it wouldn't just send a friendly hello? On Wed, Feb 1, 2012 at 4:58 PM, Kate Ericson wrote: > Hi, > > This *may* just be a Hadoop issue - it sounds like the JobTracker is > upset that it hasn't heard from one of the workers in too long (over > 600 seconds). > Can you check your Hadoop Administration pages for the cluster? Does > the cluster still seem to be functioning? > I haven't used Hadoop with EC2, so I'm not sure how difficult it will > be to check the cluster :-/ > If everything seems to be OK, there's a Hadoop setting to modify how > long it's willing to wait before assuming a machine has failed and > killing a task. > > > -Kate > > On Wed, Feb 1, 2012 at 5:48 PM, Nicholas Kolegraff > wrote: > > Hello, > > I am attempting to run parallelALS on a very large matrix on EC2. > > The matrix is ~8 Million x 1 million. vary sparse .007% has data. > > I am attempting to run on 8 nodes with 34.2 GB of memory. (m2.2xlarge) > > (I kept getting OutOfMemory exceptions so I kept upping the ante until I > > arrived at the above configuration) > > > > It makes it through the following jobs no problem: > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0001_hadoop_ParallelALSFactorizationJob-ItemRatingVectorsMappe > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0002_hadoop_ParallelALSFactorizationJob-TransposeMapper-Reduce > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0003_hadoop_ParallelALSFactorizationJob-AverageRatingMapper-Re > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0004_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM > > .... > > > ip-10-166-55-151.us-west-1.compute.internal_1328033659670_job_201201311814_0023_hadoop_ParallelALSFactorizationJob-SolveExplicitFeedbackM > > > > Then crashes here with only the following error messages: > > Task attempt_201201311814_0023_m_000000_0 failed to report status for 600 > > seconds. Killing! > > > > Each map attempt in the 23rd 'SolveExplicitFeedback' fails to report it's > > status? > > > > I'm not sure what is causing this -- I am still trying to wrap my head > > around the mahout API. > > > > Could this still be a memory issue? > > > > Hopefully i'm not missing something trivial?!?! > --f46d043be25680040f04b7f10b18--