hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer ...@yahoo-inc.com>
Subject Re: Disk configuration.
Date Mon, 20 Jul 2009 19:48:53 GMT
On 7/13/09 11:50 AM, "Dmitry Pushkarev" <umka@stanford.edu> wrote:
> What I want is that 2TB of drive space per node were equally accessible to
> both tasktracker and datanode, and I'm not sure that mounting two drives to
> separate folders achieves that.  (for example if reducer fills one drive
> will it start writing the rest of the data to second drive? )

In practice, we're starting to think that sharing the same file system space
doesn't really work.

What we sometimes see during large disk spills is that the tasks take
significantly more disk space than what was anticipated.  Since the data
node now has less space to work with (if any), the name node won't schedule
writes to those nodes.  This means the name node now searches for free space
and eventually goes into a death spiral, hunting for blocks that do not
violate the locality rules.

We are thinking more and more that putting MR and HDFS on separate file
systems is the way to go when dealing with apps that are unpredictable.
This also helps significantly with capacity planning, as the ops team knows
-exactly- how much space is for HDFS and how much is for MR.  If you use a
file system like ZFS, you can even adjust on the fly if your estimates are

View raw message