hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer ...@yahoo-inc.com>
Subject Re: New bee quick questions :-)
Date Mon, 21 Apr 2008 15:07:58 GMT

On 4/21/08 3:36 AM, "vikas" <pvssvikas@gmail.com> wrote:

    Most of your questions have been answered by Luca, from what I can see,
so let me tackle the rest a bit...

> 4) Let us suppose I want to shutdown one datanode for maintenance  purpose.
> is there any way to inform Hadoop saying that this particular datanode is
> going done -- please make sure the data in it is replicated else where ?

    You want to do datanode decommissioning.  See
http://wiki.apache.org/hadoop/FAQ#17 for details.

> 5) I was going through some videos on MAP-Reduce and few Yahoo tech talks.
> in that they were specifying a Hadoop cluster has multiple cores -- what
> does this mean ?

    I haven't watched the tech talks in ages, but we generally refer to
cores in a variety of ways.  There is the single physical box verson--an
individual processor has more than one execution unit, thereby giving it a
degree of parallelism.  Then there is the complete grid count--an individual
grid can have lots and lots of processors with lots and lots of individual
cores on those processors.... which works out to be a pretty good rough
estimation of how many individual Hadoop tasks can be run simultaneously.

>   5.1) can I have multiple instance of namenodes running in a cluster apart
> from secondary nodes ?

    No.  The name node is a single point of failure in the system.
> 6) If I go on create huge files will they be balanced among all the
> datanodes ? or do I need to change the creation of file location in the
> application.

    In addition to what Luca said, be aware that if you load a file on a
machine with a data node process, the data for that file will *always* get
loaded to that machine.  This can cause your data nodes to get extremely
unbalanced.   You are much better off doing data loads *off grid*/from
another machine.  Since you only need the hadoop configuration and binaries
available (in other words, no hadoop processes need be running), this
usually isn't too painful to do.

    In 0.16.x, there is a rebalancer to help fix this situation, but I have
no practical experience with it yet to say whether or not it works.

View raw message