Return-Path: Delivered-To: apmail-pig-user-archive@www.apache.org Received: (qmail 10644 invoked from network); 12 Dec 2010 11:18:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Dec 2010 11:18:45 -0000 Received: (qmail 7422 invoked by uid 500); 12 Dec 2010 11:18:45 -0000 Delivered-To: apmail-pig-user-archive@pig.apache.org Received: (qmail 7090 invoked by uid 500); 12 Dec 2010 11:18:41 -0000 Mailing-List: contact user-help@pig.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@pig.apache.org Delivered-To: mailing list user@pig.apache.org Received: (qmail 7082 invoked by uid 99); 12 Dec 2010 11:18:40 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Dec 2010 11:18:40 +0000 X-ASF-Spam-Status: No, hits=1.5 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of dvryaboy@gmail.com designates 74.125.82.53 as permitted sender) Received: from [74.125.82.53] (HELO mail-ww0-f53.google.com) (74.125.82.53) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 12 Dec 2010 11:18:34 +0000 Received: by wwi18 with SMTP id 18so5029522wwi.22 for ; Sun, 12 Dec 2010 03:18:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=GIMYT1cuQxWdB4K2fgVqXRjrLwQzdUh9nvwLTDJ9tkk=; b=VXXlmLZvC8+uBPYcU/ePYBTsQEfvagt9RkCqCtwwCmkB/kZIMv8BO+VHqdAvQkeraS gIiBOWZvWWouAuo89tUG7D+kmt/SYrumdWnw5GQ4PDUQ+C6c5NRSvZCf3hRNTNVOQ0NF urgdpJUpbi5wpGP1WOJ5I14lsrAXfUAIl9IEM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=vK5KyY37AN+6aGJtuiML7kdlY3Om3XARPTZnXaL8ffaLoaUIN5KQ4WCxy1we4jSkmz iwq2rgid+5DAqiejvZAQRS7X2Qxks4lyjv2miLeaMXGqPGJAnFYjP42hxxOX1ny76s5E bUsJQOYtrRmUNoSRXlF5Ps00SAbI5fWg/GHbM= MIME-Version: 1.0 Received: by 10.227.136.132 with SMTP id r4mr937150wbt.38.1292152693228; Sun, 12 Dec 2010 03:18:13 -0800 (PST) Received: by 10.227.152.212 with HTTP; Sun, 12 Dec 2010 03:18:13 -0800 (PST) In-Reply-To: <1291978414.3741.5.camel@linux-jr> References: <1291817349.17910.7.camel@linux-jr> <1291978414.3741.5.camel@linux-jr> Date: Sun, 12 Dec 2010 03:18:13 -0800 Message-ID: Subject: Re: running bigger pig jobs on amazon ec2 From: Dmitriy Ryaboy To: user@pig.apache.org Content-Type: multipart/alternative; boundary=0016363b9b38468458049734beb8 --0016363b9b38468458049734beb8 Content-Type: text/plain; charset=ISO-8859-1 Johannes, I wonder if something is putting enough pressure on the datanodes that they are unable to ack all the write requests fast enough, causing many tasks to give up due to what amounts to tcp throughput collapse. The logs certainly seem to indicate something unhealthy happening at the DFS level. Bunch of questions below... I am stabbing in the dark here, as I don't run clusters in EC2. Do you have any stats on the network traffic in your cluster while this is happening? Same, but for disk/cpu utilization and similar metrics on the data nodes? I am curious why there's a loader being instantiated in the reducer. Can you send along a relevant portion of the explain plan? How many map tasks and reduce tasks are you running? How big is the cluster? Is the storefunc you are using doing something like writing multiple files? When running a cluster in EC2, what are you using for storage? S3, EBS...? D On Fri, Dec 10, 2010 at 2:53 AM, jr wrote: > Hello Ashutosh, > > I'm running entirely on amazon ec2, and while i get those errors, i seem > to be able to access hdfs by using "hadoop fs" :/ > > regards, > Johannes > > Am Mittwoch, den 08.12.2010, 09:11 -0800 schrieb Ashutosh Chauhan: > > From the logs it looks like issue is not with Pig but with your hdfs. > > Either your hdfs is running out of space or some (or all) nodes in > > your cluster can't talk to each other (network issue ?) > > > > Ashutosh > > On Wed, Dec 8, 2010 at 06:09, jr > wrote: > > > Hi guys, > > > I'm having some trouble finished jobs that run smoothly on a smaller > > > dataset, but always fail at 99% if i try to run the job on the whole > > > set. > > > i can see a few killed map and a few killed reduce, but quite a lot of > > > failed reduce tasks that all show the same exception at the end. > > > here is what i have in the logs: > > > > > --0016363b9b38468458049734beb8--