hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andreas Kostyrka <andr...@kostyrka.org>
Subject Re: hadoop 0.17.1 reducer not fetching map output problem
Date Thu, 24 Jul 2008 20:02:49 GMT
On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
> On 7/25/08 12:09 AM, "Andreas Kostyrka" <andreas@kostyrka.org> wrote:
> > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> >> Could you try to kill the tasktracker hosting the task the next time
> >> when it happens? I just want to isolate the problem - whether it is a
> >> problem in the TT-JT communication or in the Task-TT communication. From
> >> your description it looks like the problem is between the JT-TT
> >> communication. But pls run the experiment when it happens again and let
> >> us know what happens.
> >
> > Well, I did restart the tasktracker where the reduce job was running, but
> > that lead only to a situation where the jobtracker did not restart the
> > job, showed it as still running, and was not able to kill the reduce task
> > via hadoop job -kill-task nor -fail-task.
> The reduce task would eventually be reexecuted (after some timeout,
> defaulting to 10 minutes, the tasktracker would be assumed as lost and all
> reducers that were running on that node would be reexecuted).
> > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A
> > peer at another startup confirmed the whole batch of problems I've been
> > experiencing, and for him 0.15 works for production.
> >
> > <rant-mode>
> > No question, 0.17 is way better than 0.16, on the other hand I wonder how
> > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've
> > introduced reducing to our workloads, and before 0.16 failed >80% of the
> > jobs with reducers not being able to get their output. 0.17.0 improved
> > that to a point where one can, with some pain, e.g. restarting the
> > cluster daily, not storing anything important on HDFS, only temporary
> > data, ..., use it somehow for production, at least for small jobs.) So
> > one wonders how 0.16 got released? Or was it meant only as developer-only
> > bug fixing series?
> > </rant-mode>
> Pls raise jiras for the specific problems.

I know, that's why I bracketed it as rantmode. OTOH, many of these issues had 
either this creepy feeling where you wondered if you did something wrong or 
were issues where I had to react relatively quickly, which usually destroys 
the faulty state. (I know, as a developer having reproduced a bug is golden. 
As an admin asked about processing lag, it's rather to opposite)

Plus fixing the issue in the next release or even via a patch means that I 
have a non-working cluster till then. Now I that means I would need to start 
debugging the cluster utility software instead of our apps. ;(


View raw message