Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 75916DBA6 for ; Tue, 11 Sep 2012 06:08:18 +0000 (UTC) Received: (qmail 29850 invoked by uid 500); 11 Sep 2012 06:08:13 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 29678 invoked by uid 500); 11 Sep 2012 06:08:12 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 29665 invoked by uid 99); 11 Sep 2012 06:08:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2012 06:08:12 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FSL_RCVD_USER,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of elaine-gan@gmo.jp designates 210.157.4.33 as permitted sender) Received: from [210.157.4.33] (HELO smtp.gmo.jp) (210.157.4.33) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 11 Sep 2012 06:08:06 +0000 Received: from [192.168.23.95] (unknown [172.16.253.6]) by smtp.gmo.jp (Postfix) with ESMTP id 2345A2DAE7 for ; Tue, 11 Sep 2012 15:07:45 +0900 (JST) Date: Tue, 11 Sep 2012 15:07:45 +0900 From: Elaine Gan To: user@hadoop.apache.org Subject: Re: Understanding of the hadoop distribution system (tuning) In-Reply-To: References: <20120911105610.2C8B.13FE4A9A@gmo.jp> Message-Id: <20120911150744.2C90.13FE4A9A@gmo.jp> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.60.01 [ja] X-Virus-Checked: Checked by ClamAV on apache.org Hi Hermanth Thank you for your detailed answered. Your answers helped me much in understanding, especially on the Job UI. Sorry, i missed out my specs. NameNode (JobTracker) : CPUx4 DataNode (TaskTracker) : CPUx4 I am replying inline too. > > I have a data of around 518MB, and i wrote a MR program to process it. > > Here are some of my settings in my mapred-site.xml. > > --------------------------------------------------------------- > > mapred.tasktracker.map.tasks.maximum = 20 > > mapred.tasktracker.reduce.tasks.maximum = 20 > > --------------------------------------------------------------- > > > > These two configurations essentially tell the tasktrackers that they can > run 20 maps and 20 reduces in parallel on a machine. Is this what you > intended ? (Generally the sum of these two values should equal the number > of cores on your tasktracker node, or a little more). > > Also, would help if you can tell us your cluster size - i.e. number of > slaves. Cluster size (No of slaves) = 4 Yes, i meant the maximum tasks that could be run in A machine is 20 tasks, both map & reduce. > > My block size is default, 64MB > > With my data size = 518MB, i guess setting the maximum for MR task to 20 > > is far more than enough (518/64 = 8) , did i get it correctly? > > > > > I suppose what you want is to run all the maps in parallel. For that, the > number of map slots in your cluster should be more than the number of maps > of your job (assuming there's a single job running). If the number of slots > is less than number of maps, the maps would be scheduled in multiple waves. > On your jobtracker main page, the Cluster Summary > Map Task Capacity gives > you the total slots available in your cluster. My Map Task Capacity = 80% So, from the explanation and from my data size and configuration, Data size = 518MB Number of map tasks required = 518/64 = 8 tasks This 8 tasks should be spread among 4 slaves, which means each nodes should be able to handle at least 2 tasks. And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is more than enough, so it means the approach is correct? (Well i have CPUx4 in my machine, so in case of large data, i should divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum) > > When i run the MR program, i could see in the Map/Reduce Administration > > page that the number of Maps Total = 8, so i assume that everything is > > going well here, once again if i'm wrong please correct me. > > (Sometimes it shows only Maps Total = 3) > > > This value tells us the number of maps that will run for the job. OK > > There's one thing which i'm uncertain about hadoop distribution. > > Is the Maps Total = 8 means that there are 8 map tasks split among all > > the data nodes (task trackers)? > > Is there anyway i can checked whether all the tasks are shared among > > datanodes (where task trackers are working). > > > There's no easy way to check this. The task page for every task shows the > attempts that ran for each task and where they ran under the 'Machine' > column. > Thank you, i see that they're processed on different "Machine", so i guess it's working correctly :) > > > When i clicked on each link under that Task Id, i can see there's "Input > > Split Locations" stated under each task details, if the inputs are > > splitted between data nodes, does that means that everything is working > > well? > > > > > I think this is just the location of the splits, including the replicas. > What you could see is if enough data local maps ran - which means that the > tasks mostly got their inputs from datanodes running on the same machine as > themselves. This is given by the counter "Data-local map tasks" on the job > UI page. > There are two cases under the Job UI. Counter Map Reduce Total ----------------------------------------- Case (1) Launched map tasks 0 0 4 Data-local map tasks 0 0 4 Case (2) Launched map tasks 0 0 2 Data-local map tasks 0 0 1 Hmm.. not quite understand this, if case (2) it means two map tasks are actually reading data from same datanode? But anyway, is this monitoring needed for tuning performance? Thank you.