Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=date:from:x-x-sender:to:subject:in-reply-to:message-id:
	references:mime-version:content-type;
	b=xiQcr9fFsLKIaqQuK7QtNFWbG/Ep+LIY+wHOmoS2CFyE0Sj3SJroaj5YFwOuq1sG
Date: Mon, 24 Mar 2008 01:53:38 +0530 (IST)
From: Amar Kamat <amarrk@yahoo-inc.com>
To: core-user@hadoop.apache.org
Subject: Re: One Simple Question About Hadoop DFS
In-Reply-To: <4905.54697.qm@web32808.mail.mud.yahoo.com>
Message-ID: <Pine.LNX.4.64.0803240137170.10682@localhost.localdomain>
References: <4905.54697.qm@web32808.mail.mud.yahoo.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII

On Sun, 23 Mar 2008, Chaman Singh Verma wrote:

> Hello,
>
> I am exploring Hadoop and MapReduce and I have one very simple question.
>
> I have 500GB dataset on my local disk and I have written both Map-Reduce functions. Now how should I start ?
>
> 1.  I copy the data from local disk to DFS. I have configured DFS with 100 machines. I hope that it will split the file on 100 nodes ( With some replications).
>
Yes. You need to copy the data from your local disk to the DFS. It will
split the files based on the dfs block size (dfs.block.size). Default
block size is 64MB and hence there would be 8000 blocks.
> 2. For MapReduce should I specify 100 nodes for SetMaxMapTask(). If I specify
>    less than 100 then, will be blocks migrate ? If the blocks don't migrate then
>    why this functions is provided to the users ? Why number of Tasks is not
>    taken from the startup script ?
>
Again here the max number of maps is bounded by the dfs block size. Hence
in the default case you would have 8000 maps (unless you have your own
input format).
> 3.  If I specify more than 100, then will load balancing be done automatically
>     or user have to specify that also.
>
In short its the dfs block size along with the input format that controls
the number of maps. The number of maps given to the framework is used as
a hint. Sometimes it doesn't matter what value is passed.
Amar
> Perhaps these are very simple questions, but I think that MapReduce simplifies lots of things ( Compared to MPI Based Programming ) that for beginners like me have difficult time to understand the model.
>
> csv
>
>
>
>
> ---------------------------------
> Never miss a thing.   Make Yahoo your homepage.