hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bryan A. Pendleton" ...@geekdom.net>
Subject Re: Some bugs I'm trying to characterize....
Date Wed, 08 Feb 2006 19:00:04 GMT
(Moved reply to hadoop-dev)

Thanks for your followup. I'm moving to the latest nightly hadoop, and I'll
see how things fare next time I run into these kinds of problems. More
comments below:

On 07 Feb 2006 10:16:34 -0800, Michael Cafarella <
michael_cafarella@comcast.net> wrote:
>   Hi Bryan,
> On Thu, 2006-02-02 at 12:06, Bryan A. Pendleton wrote:
> >
> > 1) If you fill up the space of a datanode, it appears to fail with the
> wrong
> > exception and reload. This, combined with the currently simple
> > block-allocation method (random), means that one "full" node can cause a
> big
> > dropoff in NDFS write performance, as clients end up timing out some
> percent
> > of the time when asked to talk to the "full" node, while the full node
> is
> > busy reloading.
> >
>   The existing code chooses a node probabilistically, with more weight
> given to machines with more space available.  I should probably change
> this so that machines under a certain limit are just never chosen.  I
> filed bug HADOOP-26 for this.

How does this work/where is this code? I've been trying to trace how it
works, and, so far, I haven't found this "weighting" code. Is this what the
last "else" in chooseTarget(TreeSet forbidden1, TreeSet forbidden2) is
doing? I admit to being a bit confused as to how that code flows... In any
case, I think the default code uses the previous clause, as
dfs.availability.allocation seems to default to "false". Should I be setting
dfs.availability.allocation to true?

  Are you running on nodes with varying amounts of disk space on each?
> I've been doing testing on relatively clean sets of machines, so
> I'm probably hitting this problem less often than I "should" be.

Yes. There are several different size drives, and some of them have other
data stored on them besides hadoop/nutch work. My cluster was built up over
a period of time, and has had several extra drives added to several of the
machines. All in all, there's a pretty wide range of hadoop-usable space
available per device... anywhere from 15gb to 200gb.

  I'm surprised by this; a reduce task that fails should be rescheduled
> elsewhere.  If a map is lost, it too should be rescheduled.  We made a
> few major changes to MapReduce in the last week.  When did you last do
> a bringover from svn?
>   But I agree that a node without space should no longer offer task
> service to the system.  I filed HADOOP-27 for this one.

I think it must've been a badly-handled corner case, or something, as I
haven't seen that problem happen elsewhere since. It might have been some
other error that exhibited itself this way.

  It's not easy to predict the size of data emitted by either map or
> reduce.  The user can always emit insane strings of arbitrary length.
>   I think the right solution is to allow the administrator to set a
> "minimum size" parameter.  Indeed, I believe this param is still in
> place from earlier DFS work.  While this might be too conservative in
> many cases, an almost-full disk should be a relatively rare edge case
> that shouldn't be optimized for.

That would be helpful. However, I don't think your statement is quite right
- there's no way to tell how large a map's output is going to be, but
shouldn't the space needed for intermediary processing of reduce be totally
predictable? Isn't it the sum of the partition output being combined in a
given reduce, doubled, since there's a "sort" after the copied-together

All the same, you can probably do better with map output by distributing
output files proportional to the space available on the temp drives, rather
than in a merely round-robing fashion. I'd submit a patch for this, but I
don't see any easy way to find free disk space from a MapTask.

  This "speculative execution" is exactly what happens in the latest
> version of MapReduce.

Excellent. I'll get myself on the latest code immediately, then.

Bryan A. Pendleton
Ph: (877) geek-1-bp
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message