Mailing-List: contact user-help@pig.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@pig.apache.org
Received-SPF: pass (athena.apache.org: domain of dvryaboy@gmail.com designates
 74.125.82.53 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type;
        b=vK5KyY37AN+6aGJtuiML7kdlY3Om3XARPTZnXaL8ffaLoaUIN5KQ4WCxy1we4jSkmz
         iwq2rgid+5DAqiejvZAQRS7X2Qxks4lyjv2miLeaMXGqPGJAnFYjP42hxxOX1ny76s5E
         bUsJQOYtrRmUNoSRXlF5Ps00SAbI5fWg/GHbM=
MIME-Version: 1.0
In-Reply-To: <1291978414.3741.5.camel@linux-jr>
References: <1291817349.17910.7.camel@linux-jr>
	<AANLkTikOGWUQu6YBJ0ktxQgOLbji-7d0VxG4-DogYd6a@mail.gmail.com>
	<1291978414.3741.5.camel@linux-jr>
Date: Sun, 12 Dec 2010 03:18:13 -0800
Message-ID: <AANLkTim08AJ-FnWJH5+tkfiKO0MWZZmreyTrpdXGQ8yz@mail.gmail.com>
Subject: Re: running bigger pig jobs on amazon ec2
From: Dmitriy Ryaboy <dvryaboy@gmail.com>
To: user@pig.apache.org
Content-Type: multipart/alternative; boundary=0016363b9b38468458049734beb8

--0016363b9b38468458049734beb8
Content-Type: text/plain; charset=ISO-8859-1

Johannes,
I wonder if something is putting enough pressure on the datanodes that they
are unable to ack all the write requests fast enough, causing many tasks to
give up due to what amounts to tcp throughput collapse.

The logs certainly seem to indicate something unhealthy happening at the DFS
level. Bunch of questions below... I am stabbing in the dark here, as I
don't run clusters in EC2.

Do you have any stats on the network traffic in your cluster while this is
happening?

Same, but for disk/cpu utilization and similar metrics on the data nodes?

I am curious why there's a loader being instantiated in the reducer. Can you
send along a relevant portion of the explain plan?

How many map tasks and reduce tasks are you running?

How big is the cluster?

Is the storefunc you are using doing something like writing multiple files?

When running a cluster in EC2, what are you using for storage? S3, EBS...?

D

On Fri, Dec 10, 2010 at 2:53 AM, jr <johannes.russek@io-consulting.net>wrote:

> Hello Ashutosh,
>
> I'm running entirely on amazon ec2, and while i get those errors, i seem
> to be able to access hdfs by using "hadoop fs" :/
>
> regards,
> Johannes
>
> Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan:
> > From the logs it looks like issue is not with Pig but with your hdfs.
> > Either your hdfs is running out of space or some (or all) nodes in
> > your cluster can't talk to each other (network issue ?)
> >
> > Ashutosh
> > On Wed, Dec 8, 2010 at 06:09, jr <johannes.russek@io-consulting.net>
> wrote:
> > > Hi guys,
> > > I'm having some trouble finished jobs that run smoothly on a smaller
> > > dataset, but always fail at 99% if i try to run the job on the whole
> > > set.
> > > i can see a few killed map and a few killed reduce, but quite a lot of
> > > failed reduce tasks that all show the same exception at the end.
> > > here is what i have in the logs:
> > >
>
>

--0016363b9b38468458049734beb8--