hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Allen Wittenauer ...@yahoo-inc.com>
Subject Re: Starting up a larger cluster
Date Fri, 08 Feb 2008 17:15:13 GMT
On 2/7/08 11:01 PM, "Tim Wintle" <tim.wintle@teamrubber.com> wrote:

>  it's
> useful to be able to connect from nodes that aren't in the slaves file
> so that you can put in input data direct from another machine that's not
> part of the cluster,

    I'd actually recommend this as a best practice.  We've been bit over...
and over... and over... with users loading data into HDFS from a data node
only to discover that the block distribution is pretty horrid.... which in
turn means that MR performance is equally horrid. [Remember: all writes will
go the local node if it is a data node!]

    We're now down to the point that we've got one (relatively smaller) grid
that is used for data loading/creation/extraction which then distcp's its
contents to another grid.

    Less than ideal, but definitely helps the performance of the entire
'real' grid.

View raw message