Return-Path: Delivered-To: apmail-hadoop-core-user-archive@www.apache.org Received: (qmail 72683 invoked from network); 23 Mar 2008 20:26:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 23 Mar 2008 20:26:05 -0000 Received: (qmail 60676 invoked by uid 500); 23 Mar 2008 20:25:57 -0000 Delivered-To: apmail-hadoop-core-user-archive@hadoop.apache.org Received: (qmail 60635 invoked by uid 500); 23 Mar 2008 20:25:57 -0000 Mailing-List: contact core-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-user@hadoop.apache.org Delivered-To: mailing list core-user@hadoop.apache.org Received: (qmail 60626 invoked by uid 99); 23 Mar 2008 20:25:57 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Mar 2008 13:25:57 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [216.109.112.27] (HELO mrout1-b.corp.dcn.yahoo.com) (216.109.112.27) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 23 Mar 2008 20:25:05 +0000 Received: from vpn-client-88-12.eglbp.corp.yahoo.com (vpn-client-88-12.eglbp.corp.yahoo.com [10.66.88.12]) by mrout1-b.corp.dcn.yahoo.com (8.13.8/8.13.8/y.out) with ESMTP id m2NKPIgK075353 for ; Sun, 23 Mar 2008 13:25:20 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=date:from:x-x-sender:to:subject:in-reply-to:message-id: references:mime-version:content-type; b=xiQcr9fFsLKIaqQuK7QtNFWbG/Ep+LIY+wHOmoS2CFyE0Sj3SJroaj5YFwOuq1sG Date: Mon, 24 Mar 2008 01:53:38 +0530 (IST) From: Amar Kamat X-X-Sender: amarrk@localhost.localdomain To: core-user@hadoop.apache.org Subject: Re: One Simple Question About Hadoop DFS In-Reply-To: <4905.54697.qm@web32808.mail.mud.yahoo.com> Message-ID: References: <4905.54697.qm@web32808.mail.mud.yahoo.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Virus-Checked: Checked by ClamAV on apache.org On Sun, 23 Mar 2008, Chaman Singh Verma wrote: > Hello, > > I am exploring Hadoop and MapReduce and I have one very simple question. > > I have 500GB dataset on my local disk and I have written both Map-Reduce functions. Now how should I start ? > > 1. I copy the data from local disk to DFS. I have configured DFS with 100 machines. I hope that it will split the file on 100 nodes ( With some replications). > Yes. You need to copy the data from your local disk to the DFS. It will split the files based on the dfs block size (dfs.block.size). Default block size is 64MB and hence there would be 8000 blocks. > 2. For MapReduce should I specify 100 nodes for SetMaxMapTask(). If I specify > less than 100 then, will be blocks migrate ? If the blocks don't migrate then > why this functions is provided to the users ? Why number of Tasks is not > taken from the startup script ? > Again here the max number of maps is bounded by the dfs block size. Hence in the default case you would have 8000 maps (unless you have your own input format). > 3. If I specify more than 100, then will load balancing be done automatically > or user have to specify that also. > In short its the dfs block size along with the input format that controls the number of maps. The number of maps given to the framework is used as a hint. Sometimes it doesn't matter what value is passed. Amar > Perhaps these are very simple questions, but I think that MapReduce simplifies lots of things ( Compared to MPI Based Programming ) that for beginners like me have difficult time to understand the model. > > csv > > > > > --------------------------------- > Never miss a thing. Make Yahoo your homepage.