Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 8550 invoked from network); 24 Jul 2008 20:03:10 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Jul 2008 20:03:10 -0000 Received: (qmail 16308 invoked by uid 500); 24 Jul 2008 20:03:05 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 15657 invoked by uid 500); 24 Jul 2008 20:03:04 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 15645 invoked by uid 99); 24 Jul 2008 20:03:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jul 2008 13:03:04 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [81.169.154.44] (HELO heaven.kostyrka.org) (81.169.154.44) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 24 Jul 2008 20:02:09 +0000 Received: from localhost (localhost [127.0.0.1]) by heaven.kostyrka.org (Postfix) with ESMTP id 3D0554FD22; Thu, 24 Jul 2008 22:02:33 +0200 (CEST) Received: from heaven.kostyrka.org ([127.0.0.1]) by localhost (heaven.kostyrka.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id 02604-06; Thu, 24 Jul 2008 22:02:33 +0200 (CEST) Received: from andi-lap.local (91-113-93-70.adsl.highway.telekom.at [91.113.93.70]) by heaven.kostyrka.org (Postfix) with ESMTP id C64104FCD1; Thu, 24 Jul 2008 22:02:32 +0200 (CEST) From: Andreas Kostyrka To: core-user@hadoop.apache.org Subject: Re: hadoop 0.17.1 reducer not fetching map output problem Date: Thu, 24 Jul 2008 22:02:49 +0200 User-Agent: KMail/1.9.9 Cc: Devaraj Das References: In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2211351.HSbgm6ggPF"; protocol="application/pgp-signature"; micalg=pgp-sha1 Content-Transfer-Encoding: 7bit Message-Id: <200807242202.53125.andreas@kostyrka.org> X-Virus-Checked: Checked by ClamAV on apache.org --nextPart2211351.HSbgm6ggPF Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Thursday 24 July 2008 21:40:22 Devaraj Das wrote: > On 7/25/08 12:09 AM, "Andreas Kostyrka" wrote: > > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: > >> Could you try to kill the tasktracker hosting the task the next time > >> when it happens? I just want to isolate the problem - whether it is a > >> problem in the TT-JT communication or in the Task-TT communication. Fr= om > >> your description it looks like the problem is between the JT-TT > >> communication. But pls run the experiment when it happens again and let > >> us know what happens. > > > > Well, I did restart the tasktracker where the reduce job was running, b= ut > > that lead only to a situation where the jobtracker did not restart the > > job, showed it as still running, and was not able to kill the reduce ta= sk > > via hadoop job -kill-task nor -fail-task. > > The reduce task would eventually be reexecuted (after some timeout, > defaulting to 10 minutes, the tasktracker would be assumed as lost and all > reducers that were running on that node would be reexecuted). > > > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A > > peer at another startup confirmed the whole batch of problems I've been > > experiencing, and for him 0.15 works for production. > > > > > > No question, 0.17 is way better than 0.16, on the other hand I wonder h= ow > > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've > > introduced reducing to our workloads, and before 0.16 failed >80% of the > > jobs with reducers not being able to get their output. 0.17.0 improved > > that to a point where one can, with some pain, e.g. restarting the > > cluster daily, not storing anything important on HDFS, only temporary > > data, ..., use it somehow for production, at least for small jobs.) So > > one wonders how 0.16 got released? Or was it meant only as developer-on= ly > > bug fixing series? > > > > Pls raise jiras for the specific problems. I know, that's why I bracketed it as rantmode. OTOH, many of these issues h= ad=20 either this creepy feeling where you wondered if you did something wrong or= =20 were issues where I had to react relatively quickly, which usually destroys= =20 the faulty state. (I know, as a developer having reproduced a bug is golden= =2E=20 As an admin asked about processing lag, it's rather to opposite) Plus fixing the issue in the next release or even via a patch means that I= =20 have a non-working cluster till then. Now I that means I would need to star= t=20 debugging the cluster utility software instead of our apps. ;( Andreas --nextPart2211351.HSbgm6ggPF Content-Type: application/pgp-signature; name=signature.asc Content-Description: This is a digitally signed message part. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQBIiN/tHJdudm4KnO0RAp9qAKCIcogP4AYSvGw02WXy+Dg6vgUAIQCfdoED XzH7tPiBSlYeW9NgfG1arOA= =j4bS -----END PGP SIGNATURE----- --nextPart2211351.HSbgm6ggPF--