On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
> On 7/25/08 12:09 AM, "Andreas Kostyrka" <andreas@kostyrka.org> wrote:
> > On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
> >> Could you try to kill the tasktracker hosting the task the next time
> >> when it happens? I just want to isolate the problem - whether it is a
> >> problem in the TT-JT communication or in the Task-TT communication. From
> >> your description it looks like the problem is between the JT-TT
> >> communication. But pls run the experiment when it happens again and let
> >> us know what happens.
> >
> > Well, I did restart the tasktracker where the reduce job was running, but
> > that lead only to a situation where the jobtracker did not restart the
> > job, showed it as still running, and was not able to kill the reduce task
> > via hadoop job -kill-task nor -fail-task.
>
> The reduce task would eventually be reexecuted (after some timeout,
> defaulting to 10 minutes, the tasktracker would be assumed as lost and all
> reducers that were running on that node would be reexecuted).
>
> > I hope to avoid a repeat, I'll be relapsing out cluster to 0.15 today. A
> > peer at another startup confirmed the whole batch of problems I've been
> > experiencing, and for him 0.15 works for production.
> >
> > <rant-mode>
> > No question, 0.17 is way better than 0.16, on the other hand I wonder how
> > 0.16 could get released? (I'm using streaming.jar, and with 0.16.x I've
> > introduced reducing to our workloads, and before 0.16 failed >80% of the
> > jobs with reducers not being able to get their output. 0.17.0 improved
> > that to a point where one can, with some pain, e.g. restarting the
> > cluster daily, not storing anything important on HDFS, only temporary
> > data, ..., use it somehow for production, at least for small jobs.) So
> > one wonders how 0.16 got released? Or was it meant only as developer-only
> > bug fixing series?
> > </rant-mode>
>
> Pls raise jiras for the specific problems.
I know, that's why I bracketed it as rantmode. OTOH, many of these issues had
either this creepy feeling where you wondered if you did something wrong or
were issues where I had to react relatively quickly, which usually destroys
the faulty state. (I know, as a developer having reproduced a bug is golden.
As an admin asked about processing lag, it's rather to opposite)
Plus fixing the issue in the next release or even via a patch means that I
have a non-working cluster till then. Now I that means I would need to start
debugging the cluster utility software instead of our apps. ;(
Andreas
|