Return-Path: X-Original-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-hdfs-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 9425810CC6 for ; Tue, 25 Feb 2014 07:50:11 +0000 (UTC) Received: (qmail 72289 invoked by uid 500); 25 Feb 2014 07:50:03 -0000 Delivered-To: apmail-hadoop-hdfs-user-archive@hadoop.apache.org Received: (qmail 72159 invoked by uid 500); 25 Feb 2014 07:50:02 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 72151 invoked by uid 99); 25 Feb 2014 07:50:01 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Feb 2014 07:50:01 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of drdwitte@gmail.com designates 209.85.217.169 as permitted sender) Received: from [209.85.217.169] (HELO mail-lb0-f169.google.com) (209.85.217.169) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Feb 2014 07:49:56 +0000 Received: by mail-lb0-f169.google.com with SMTP id q8so3011761lbi.14 for ; Mon, 24 Feb 2014 23:49:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Nfq0CNxX6pt4qUitNremaDEvAb53g8wU1felG5DVods=; b=dN7kbemZLBEOYGYeM5rKw51Syb0lRo14/TxFxNMY1/hAovjByWAbmYrxOpgVbjGUWh /CLFuwos3gpd+1mcpY/h0pUW2MMqy11SdSwG07//rAazXadk9/b0Gi9zsDmefgaoFnKe Wo2uRDGwBIvJhcE/k+6X3vB1wS8MLUKgMasunrTZRiqzZyYWjOwl65gE7A7puaUDxhUl 9LqoHoCPVZAYMrqyTnm580ctBfTJIQuPAwySnfsSNhi5dHnFVu0IE85dKtuTZDdmEb4R uqhowQFOTElSXli+LZ+X4ldWebYUqi065x+2eRYAXRQZVehNfmEc9KdwdpH7Nf255Cj2 4eCg== MIME-Version: 1.0 X-Received: by 10.112.89.38 with SMTP id bl6mr638547lbb.31.1393314575006; Mon, 24 Feb 2014 23:49:35 -0800 (PST) Received: by 10.112.144.101 with HTTP; Mon, 24 Feb 2014 23:49:34 -0800 (PST) In-Reply-To: References: Date: Tue, 25 Feb 2014 08:49:34 +0100 Message-ID: Subject: Re: Mappers vs. Map tasks From: Dieter De Witte To: user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001a11c37a1a4dbe5d04f3365448 X-Virus-Checked: Checked by ClamAV on apache.org --001a11c37a1a4dbe5d04f3365448 Content-Type: text/plain; charset=ISO-8859-1 Each node has a tasktracker with a number of map slots. A map slot hosts as mapper. A mapper executes map tasks. If there are more map tasks than slots obviously there will be multiple rounds of mapping. The map function is called once for each input record. A block is typically 64MB and can contain a multitude of record, therefore a map task = run the map() function on all records in the block. Number of blocks = no. of map tasks (not mappers) Furthermore you have to make a distinction between the two layers. You have a layer for computations which consists of a jobtracker and a set of tasktrackers. The other layer is responsible for storage. The HDFS has a namenode and a set of datanodes. In mapreduce the code is executed where the data is. So if a block is in datanode 1, 2 and 3, then the map task associated with this block will likely be executed on one of those physical nodes, by tasktracker 1, 2 or 3. But this is not necessary, thing can be rearranged. Hopefully this gives you a little more insigth. Regards, Dieter 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar : > One more thing to ask: No. of blocks = no. of mappers. Thus, those many > no. of times the map() function will be called right? > > -- > Thanks & Regards, > Sugandha Naolekar > > > > > > On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar < > sugandha.n87@gmail.com> wrote: > >> Hello, >> >> As per the various articles I went through till date, the File(s) are >> split in chunks/blocks. On the same note, would like to ask few things: >> >> >> 1. No. of mappers are decided as: Total_File_Size/Max. Block Size. >> Thus, if the file is smaller than the block size, only one mapper will be >> invoked. Right? >> 2. If yes, it means, the map() will be called only once. Right? In >> this case, if there are two datanodes with a replication factor as 1: only >> one datanode(mapper machine) will perform the task. Right? >> 3. The map() function is called by all the datanodes/slaves right? If >> the no. of mappers are more than the no. of slaves, what happens? >> >> -- >> Thanks & Regards, >> Sugandha Naolekar >> >> >> >> > --001a11c37a1a4dbe5d04f3365448 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Each node has a tasktracker with = a number of map slots. A map slot hosts as mapper. A mapper executes map ta= sks. If there are more map tasks than slots obviously there will be multipl= e rounds of mapping.

The map function is called once for each input record. A block is= typically 64MB and can contain a multitude of record, therefore a map task= =3D run the map() function on all records in the block.

Numbe= r of blocks =3D no. of map tasks (not mappers)

Furthermore you have to make a distinction between the two layers= . You have a layer for computations which consists of a jobtracker and a se= t of tasktrackers. The other layer is responsible for storage. The HDFS has= a namenode and a set of datanodes.

In mapreduce the code is executed where the data is. So if a bloc= k is in datanode 1, 2 and 3, then the map task associated with this block w= ill likely be executed on one of those physical nodes, by tasktracker 1, 2 = or 3. But this is not necessary, thing can be rearranged.

Hopefully this gives you a little more insigth.

Regards, D= ieter


2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com= >:
One more thing to = ask: No. of blocks =3D no. of mappers. Thus, those many no. of times the ma= p() function will be called right?

--=
Thanks &=A0Regards,
Sugandha Naolekar





On Tue, Feb= 25, 2014 at 11:27 AM, Sugandha Naolekar <sugandha.n87@gmail.com&= gt; wrote:
Hello,

As per the various a= rticles I went through till date, the File(s) are split in chunks/blocks. O= n the same note, would like to ask few things:

  1. No. of mappers are decided as: Total_File_Size/Max. Block Size.= Thus, if the file is smaller than the block size, only one mapper will be = invoked. Right?
  2. If yes, it means, the map() will be called only onc= e. Right? In this case, if there are two datanodes with a replication facto= r as 1: only one datanode(mapper machine) will perform the task. Right?
  3. The map() function is called by all the datanodes/slaves right? If the = no. of mappers are more than the no. of slaves, what happens?
=
--
Thanks &=A0Regards,
Sugandha Naole= kar





--001a11c37a1a4dbe5d04f3365448--