pig-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitriy Ryaboy <dvrya...@gmail.com>
Subject Re: running bigger pig jobs on amazon ec2
Date Sun, 12 Dec 2010 11:18:13 GMT
I wonder if something is putting enough pressure on the datanodes that they
are unable to ack all the write requests fast enough, causing many tasks to
give up due to what amounts to tcp throughput collapse.

The logs certainly seem to indicate something unhealthy happening at the DFS
level. Bunch of questions below... I am stabbing in the dark here, as I
don't run clusters in EC2.

Do you have any stats on the network traffic in your cluster while this is

Same, but for disk/cpu utilization and similar metrics on the data nodes?

I am curious why there's a loader being instantiated in the reducer. Can you
send along a relevant portion of the explain plan?

How many map tasks and reduce tasks are you running?

How big is the cluster?

Is the storefunc you are using doing something like writing multiple files?

When running a cluster in EC2, what are you using for storage? S3, EBS...?


On Fri, Dec 10, 2010 at 2:53 AM, jr <johannes.russek@io-consulting.net>wrote:

> Hello Ashutosh,
> I'm running entirely on amazon ec2, and while i get those errors, i seem
> to be able to access hdfs by using "hadoop fs" :/
> regards,
> Johannes
> Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan:
> > From the logs it looks like issue is not with Pig but with your hdfs.
> > Either your hdfs is running out of space or some (or all) nodes in
> > your cluster can't talk to each other (network issue ?)
> >
> > Ashutosh
> > On Wed, Dec 8, 2010 at 06:09, jr <johannes.russek@io-consulting.net>
> wrote:
> > > Hi guys,
> > > I'm having some trouble finished jobs that run smoothly on a smaller
> > > dataset, but always fail at 99% if i try to run the job on the whole
> > > set.
> > > i can see a few killed map and a few killed reduce, but quite a lot of
> > > failed reduce tasks that all show the same exception at the end.
> > > here is what i have in the logs:
> > >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message