hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Eastman" <jeast...@collab.net>
Subject RE: Starting up a larger cluster
Date Fri, 08 Feb 2008 17:32:41 GMT
I noticed that phenomena right off the bat. Is that a designed "feature"
or just an unhappy consequence of how blocks are allocated? Ted
compensates for this by aggressively rebalancing his cluster often by
adjusting the replication up and down, but I wonder if an improvement in
the allocation strategy would improve this. 

I've also used Ted's trick, with less than marvelous results. I'd hate
to pull my biggest machine (where I store all the backup files) out of
the cluster just to get more even block distribution but I may have to.

Jeff

-----Original Message-----
From: Allen Wittenauer [mailto:aw@yahoo-inc.com] 
Sent: Friday, February 08, 2008 9:15 AM
To: core-user@hadoop.apache.org
Subject: Re: Starting up a larger cluster

On 2/7/08 11:01 PM, "Tim Wintle" <tim.wintle@teamrubber.com> wrote:

>  it's
> useful to be able to connect from nodes that aren't in the slaves file
> so that you can put in input data direct from another machine that's
not
> part of the cluster,

    I'd actually recommend this as a best practice.  We've been bit
over...
and over... and over... with users loading data into HDFS from a data
node
only to discover that the block distribution is pretty horrid.... which
in
turn means that MR performance is equally horrid. [Remember: all writes
will
go the local node if it is a data node!]

    We're now down to the point that we've got one (relatively smaller)
grid
that is used for data loading/creation/extraction which then distcp's
its
contents to another grid.

    Less than ideal, but definitely helps the performance of the entire
'real' grid.



Mime
View raw message