Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
From: Andreas Kostyrka <andreas@kostyrka.org>
To: core-user@hadoop.apache.org
Subject: Re: hadoop 0.17.1 reducer not fetching map output problem
Date: Thu, 24 Jul 2008 22:02:49 +0200
User-Agent: KMail/1.9.9
Cc: Devaraj Das <ddas@yahoo-inc.com>
References: <C4AED87E.479BE%ddas@yahoo-inc.com>
In-Reply-To: <C4AED87E.479BE%ddas@yahoo-inc.com>
MIME-Version: 1.0
Content-Type: multipart/signed;
  boundary="nextPart2211351.HSbgm6ggPF";
  protocol="application/pgp-signature";
  micalg=pgp-sha1
Content-Transfer-Encoding: 7bit
Message-Id: <200807242202.53125.andreas@kostyrka.org>

--nextPart2211351.HSbgm6ggPF
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
> On 7/25/08 12:09 AM, "Andreas Kostyrka" <andreas@kostyrka.org> wrote:
> > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> >> Could you try to kill the tasktracker hosting the task the next time
> >> when it happens? I just want to isolate the problem - whether it is a
> >> problem in the TT-JT communication or in the Task-TT communication. Fr=
om
> >> your description it looks like the problem is between the JT-TT
> >> communication. But pls run the experiment when it happens again and let
> >> us know what happens.
> >
> > Well, I did restart the tasktracker where the reduce job was running, b=
ut
> > that lead only to a situation where the jobtracker did not restart the
> > job, showed it as still running, and was not able to kill the reduce ta=
sk
> > via hadoop job -kill-task nor -fail-task.
>
> The reduce task would eventually be reexecuted (after some timeout,
> defaulting to 10 minutes, the tasktracker would be assumed as lost and all
> reducers that were running on that node would be reexecuted).
>
> > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A
> > peer at another startup confirmed the whole batch of problems I've been
> > experiencing, and for him 0.15 works for production.
> >
> > <rant-mode>
> > No question, 0.17 is way better than 0.16, on the other hand I wonder h=
ow
> > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've
> > introduced reducing to our workloads, and before 0.16 failed >80% of the
> > jobs with reducers not being able to get their output. 0.17.0 improved
> > that to a point where one can, with some pain, e.g. restarting the
> > cluster daily, not storing anything important on HDFS, only temporary
> > data, ..., use it somehow for production, at least for small jobs.) So
> > one wonders how 0.16 got released? Or was it meant only as developer-on=
ly
> > bug fixing series?
> > </rant-mode>
>
> Pls raise jiras for the specific problems.

I know, that's why I bracketed it as rantmode. OTOH, many of these issues h=
ad=20
either this creepy feeling where you wondered if you did something wrong or=
=20
were issues where I had to react relatively quickly, which usually destroys=
=20
the faulty state. (I know, as a developer having reproduced a bug is golden=
=2E=20
As an admin asked about processing lag, it's rather to opposite)

Plus fixing the issue in the next release or even via a patch means that I=
=20
have a non-working cluster till then. Now I that means I would need to star=
t=20
debugging the cluster utility software instead of our apps. ;(

Andreas

--nextPart2211351.HSbgm6ggPF
Content-Type: application/pgp-signature; name=signature.asc 
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQBIiN/tHJdudm4KnO0RAp9qAKCIcogP4AYSvGw02WXy+Dg6vgUAIQCfdoED
XzH7tPiBSlYeW9NgfG1arOA=
=j4bS
-----END PGP SIGNATURE-----

--nextPart2211351.HSbgm6ggPF--