Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 15EE610C0E for ; Thu, 27 Feb 2014 04:22:31 +0000 (UTC) Received: (qmail 31074 invoked by uid 500); 27 Feb 2014 04:22:23 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 30328 invoked by uid 500); 27 Feb 2014 04:22:22 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 30320 invoked by uid 99); 27 Feb 2014 04:22:21 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Feb 2014 04:22:21 +0000 X-ASF-Spam-Status: No, hits=1.7 required=5.0 tests=FREEMAIL_ENVFROM_END_DIGIT,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of sugandha.n87@gmail.com designates 209.85.217.178 as permitted sender) Received: from [209.85.217.178] (HELO mail-lb0-f178.google.com) (209.85.217.178) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 27 Feb 2014 04:22:14 +0000 Received: by mail-lb0-f178.google.com with SMTP id s7so1281825lbd.23 for ; Wed, 26 Feb 2014 20:21:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=7nn0ol8FF4uc/RaTK0NGzCpdHp68yJrdUxsG6GqN/0E=; b=pvZF6NBuWLFFZLA4E4ZMCbl2AMY93nftLTck7aY7b00aZuM3cONZ7KAjSxmIfZvHaG xHEr2Kn3ts2A6Li95/IPzAsDTHCNe/KOABkazwzpp4iPNvOD/qOu+Due1vhm0up2Jxd0 oiOCkRMv/W7A9o1qb7GP2e0wU4CdxcAbnVbnre/u7yk3qHPeqOqvt2t/SlIrbImu3CT0 jwJzInzp+kRY31/7T5p0x50Vk/HSmB4GE9/Hx/LulGyPqHGajLaU+W51FBuJncAOGPo5 NsIVo67MZUH6w6UJC00tS/Sesr9MHqQ2r4MmKczE2Q5qrZmnpA+w9nq4YML/DEzMsQa/ 3fCw== X-Received: by 10.113.5.167 with SMTP id cn7mr4618330lbd.1.1393474913860; Wed, 26 Feb 2014 20:21:53 -0800 (PST) MIME-Version: 1.0 Received: by 10.112.150.34 with HTTP; Wed, 26 Feb 2014 20:21:13 -0800 (PST) In-Reply-To: References: From: Sugandha Naolekar Date: Thu, 27 Feb 2014 09:51:13 +0530 Message-ID: Subject: Re: Mappers vs. Map tasks To: "core-user@hadoop.apache.org" Content-Type: multipart/alternative; boundary=001a11347b7c3e7f4604f35ba921 X-Virus-Checked: Checked by ClamAV on apache.org --001a11347b7c3e7f4604f35ba921 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Joao Paulo, Your suggestion is appreciated. Although, on a side note, what is more tedious: Writing a custom InputFormat or changing the code which is generating the input splits.? -- Thanks & Regards, Sugandha Naolekar On Wed, Feb 26, 2014 at 8:03 PM, Jo=E3o Paulo Forny wro= te: > If I understood your problem correctly, you have one huge JSON, which is > basically a JSONArray, and you want to process one JSONObject of the arra= y > at a time. > > I have faced the same issue some time ago and instead of changing the > input format, I changed the code that was generating this input, to > generate lots of JSONObjects, one per line. Hence, using the default > TextInputFormat, the map function was getting called with the entire JSON= . > > A JSONArray is not good for a mapreduce input since it has a first [ and = a > last ] and commas between the JSONs of the array. The array can be > represented as the file that the JSONs belong. > > Of course, this approach works only if you can modify what is generating > the input you're talking about. > > > 2014-02-26 8:25 GMT-03:00 Mohammad Tariq : > > In that case you have to convert your JSON data into seq files first and >> then do the processing. >> >> Warm Regards, >> Tariq >> cloudfront.blogspot.com >> >> >> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar < >> sugandha.n87@gmail.com> wrote: >> >>> Can I use SequenceFileInputFormat to do the same? >>> >>> -- >>> Thanks & Regards, >>> Sugandha Naolekar >>> >>> >>> >>> >>> >>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq wro= te: >>> >>>> Since there is no OOTB feature that allows this, you have to write you= r >>>> custom InputFormat to handle JSON data. Alternatively you could make u= se of >>>> Pig or Hive as they have builtin JSON support. >>>> >>>> Warm Regards, >>>> Tariq >>>> cloudfront.blogspot.com >>>> >>>> >>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju < >>>> rajeshnagaraju@gmail.com> wrote: >>>> >>>>> 1 simple way is to remove the new line characters so that the default >>>>> record reader and default way the block is read will take care of the= input >>>>> splits and JSON will not get affected by the removal of NL character >>>>> >>>>> >>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar < >>>>> sugandha.n87@gmail.com> wrote: >>>>> >>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it wil= l >>>>>> be split into two blocks. Now, since my file is a json file, I canno= t use >>>>>> textinputformat. As, every input split(logical) will be a single lin= e of >>>>>> the json file. Which I dont want. Thus, in this case, can I write a = custom >>>>>> input format and a custom record reader so that, every input split(l= ogical) >>>>>> will have only that part of data which I require. >>>>>> >>>>>> For. e.g: >>>>>> >>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": >>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": >>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552= 595, >>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET": >>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_": >>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION": >>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM= ": >>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_= K": >>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString", >>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.86998= 6, >>>>>> 1458938.970140 ] ] } } >>>>>> , >>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": >>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": >>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538= 462, >>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET": >>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_": >>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION": >>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM"= : >>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_= K": >>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString", >>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.14455= 0, >>>>>> 1458829.592709 ] ] } } >>>>>> >>>>>> *I want here the every input split to consist of entire type data an= d >>>>>> thus, I can process it accordingly by giving relevant k,V pairs to t= he map >>>>>> function.* >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks & Regards, >>>>>> Sugandha Naolekar >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq = wrote: >>>>>> >>>>>>> Hi Sugandha, >>>>>>> >>>>>>> Please find my comments embedded below : >>>>>>> >>>>>>> No. of mappers are decided as: >>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than = the >>>>>>> block size, only one mapper will be i= nvoked. >>>>>>> Right? >>>>>>> This is true(but not always). The basic criteria >>>>>>> behind map creation is the logic inside *getSplits* method of >>>>>>> *InputFormat* being used in your MR job. It is >>>>>>> the behavior of *file based InputFormats*, typically sub-classes of >>>>>>> *FileInputFormat*, to split the input data into splits based >>>>>>> on the total size, in bytes, of the input files. See >>>>>>> *this*for more details. And yes, if the file is small= er than the block size then >>>>>>> only 1 mapper will be created. >>>>>>> >>>>>>> If yes, it means, the map() will be called only >>>>>>> once. Right? In this case, if there are two datanodes with a replic= ation >>>>>>> factor as 1: only one datanode(mapper >>>>>>> machine) will perform the task. Right? >>>>>>> A mapper is called for each split. Don't get >>>>>>> confused with the MR's split and HDFS's block. Both are different(T= hey may >>>>>>> overlap though, as in case of FileInputFormat).= HDFS >>>>>>> blocks are physical partitioning of your data, while an InputSplit = is just >>>>>>> a logical partitioning. If you have a file wh= ich is >>>>>>> smaller than the HDFS blocksize then only one split will be created= , hence >>>>>>> only 1 mapper will be called. And this will happen on >>>>>>> the node where this file resides. >>>>>>> >>>>>>> The map() function is called by all the >>>>>>> datanodes/slaves right? If the no. of mappers are more than the no.= of >>>>>>> slaves, what happens? >>>>>>> map() doesn't get called by anybody. It rather >>>>>>> gets created on the node where the chunk of data to be processed re= sides. A >>>>>>> slave node can run multiple mappers based on = the >>>>>>> availability of CPU slots. >>>>>>> >>>>>>> One more thing to ask: No. of blocks =3D no. of >>>>>>> mappers. Thus, those many no. of times the map() function will be c= alled >>>>>>> right? >>>>>>> No. of blocks =3D no. of splits =3D no. of mappers= . A >>>>>>> map is called only once per split per node where that split is pres= ent. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Warm Regards, >>>>>>> Tariq >>>>>>> cloudfront.blogspot.com >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar < >>>>>>> sugandha.n87@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Bertrand, >>>>>>>> >>>>>>>> As you said, no. of HDFS blocks =3D no. of input splits. But this= is >>>>>>>> only true when you set isSplittable() as false or when your input = file size >>>>>>>> is less than the block size. Also, when it comes to text files, th= e default >>>>>>>> textinputformat considers each line as one input split which can b= e then >>>>>>>> read by RecordReader in K,V format. >>>>>>>> >>>>>>>> Please correct me if I don't make sense. >>>>>>>> >>>>>>>> -- >>>>>>>> Thanks & Regards, >>>>>>>> Sugandha Naolekar >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux < >>>>>>>> dechouxb@gmail.com> wrote: >>>>>>>> >>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources. >>>>>>>>> >>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3r= d/chapter-7/input-formats >>>>>>>>> >>>>>>>>> Mapper is the name of the abstract class/interface. It does not >>>>>>>>> really make sense to talk about number of mappers. >>>>>>>>> A task is a jvm that can be launched only if there is a free slot >>>>>>>>> ie for a given slot, at a given time, there will be at maximum on= ly a >>>>>>>>> single task. During the task, the configured Mapper will be insta= ntiated. >>>>>>>>> >>>>>>>>> Always : >>>>>>>>> Number of input splits =3D no. of map tasks >>>>>>>>> >>>>>>>>> And generally : >>>>>>>>> number of hdfs blocks =3D number of input splits >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> Bertrand >>>>>>>>> >>>>>>>>> PS : I don't know if it is only my client, but avoid red when >>>>>>>>> writting a mail. >>>>>>>>> >>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte < >>>>>>>>> drdwitte@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Each node has a tasktracker with a number of map slots. A map >>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are = more map >>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapp= ing. >>>>>>>>>> >>>>>>>>>> The map function is called once for each input record. A block i= s >>>>>>>>>> typically 64MB and can contain a multitude of record, therefore = a map task >>>>>>>>>> =3D run the map() function on all records in the block. >>>>>>>>>> >>>>>>>>>> Number of blocks =3D no. of map tasks (not mappers) >>>>>>>>>> >>>>>>>>>> Furthermore you have to make a distinction between the two >>>>>>>>>> layers. You have a layer for computations which consists of a jo= btracker >>>>>>>>>> and a set of tasktrackers. The other layer is responsible for st= orage. The >>>>>>>>>> HDFS has a namenode and a set of datanodes. >>>>>>>>>> >>>>>>>>>> In mapreduce the code is executed where the data is. So if a >>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated wi= th this >>>>>>>>>> block will likely be executed on one of those physical nodes, by >>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be r= earranged. >>>>>>>>>> >>>>>>>>>> Hopefully this gives you a little more insigth. >>>>>>>>>> >>>>>>>>>> Regards, Dieter >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar < >>>>>>>>>> sugandha.n87@gmail.com>: >>>>>>>>>> >>>>>>>>>> One more thing to ask: No. of blocks =3D no. of mappers. Thus, >>>>>>>>>>> those many no. of times the map() function will be called right= ? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Thanks & Regards, >>>>>>>>>>> Sugandha Naolekar >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar < >>>>>>>>>>> sugandha.n87@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>> >>>>>>>>>>>> As per the various articles I went through till date, the >>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would li= ke to ask few >>>>>>>>>>>> things: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 1. No. of mappers are decided as: Total_File_Size/Max. >>>>>>>>>>>> Block Size. Thus, if the file is smaller than the block siz= e, only one >>>>>>>>>>>> mapper will be invoked. Right? >>>>>>>>>>>> 2. If yes, it means, the map() will be called only once. >>>>>>>>>>>> Right? In this case, if there are two datanodes with a repl= ication factor >>>>>>>>>>>> as 1: only one datanode(mapper machine) will perform the ta= sk. Right? >>>>>>>>>>>> 3. The map() function is called by all the datanodes/slaves >>>>>>>>>>>> right? If the no. of mappers are more than the no. of slave= s, what happens? >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>> Sugandha Naolekar >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > --001a11347b7c3e7f4604f35ba921 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
Joao Paulo,

Your suggestion= is appreciated. Although, on a side note, what is more tedious: Writing a = custom InputFormat or changing the code which is generating the input split= s.?

--
Thanks &=A0Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 8:03 PM, Jo=E3o = Paulo Forny <jpforny@gmail.com> wrote:
If I understood your problem correctly, you have one huge = JSON, which is basically a JSONArray, and you want to process one JSONObjec= t of the array at a time.

I have faced the same issue so= me time ago and instead of changing the input format, I changed the code th= at was generating this input, to generate lots of JSONObjects, one per line= . Hence, using the default TextInputFormat, the map function was getting ca= lled with the entire JSON.

A JSONArray is not good for a mapreduce input since it = has a first [ and a last ] and commas between the JSONs of the array. The a= rray can be represented as the file that the JSONs belong.

Of course, this approach works only if you can modify what is ge= nerating the input you're talking about.


2014-02-26 8:25 GMT-03:00 Mohamm= ad Tariq <dontariq@gmail.com>:

In that case you have to co= nvert your JSON data into seq files first and then do the processing.
=

Warm Regards,
Tariq
<= a href=3D"http://cloudfront.blogspot.com" target=3D"_blank">cloudfront.blog= spot.com


On Wed, Feb 26, 2014 at 4:43 PM, Sugandh= a Naolekar <sugandha.n87@gmail.com> wrote:
Can I use SequenceFileInputFormat to do the same?=

--
Thanks &=A0Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at = 4:38 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
Since there is no OOTB feature that allows this, you have = to write your custom InputFormat to handle JSON data. Alternatively you cou= ld make use of Pig or Hive as they have builtin JSON support.

Warm Regards,
Tariq
<= a href=3D"http://cloudfront.blogspot.com" target=3D"_blank">cloudfront.blog= spot.com


On Wed, Feb 26, 2014 at 10:07 AM, Rajesh= Nagaraju <rajeshnagaraju@gmail.com> wrote:
1 simple way is to remove the new line characters so that = the default record reader and default way the block is read will take care = of the input splits and JSON will not get affected by the removal of NL cha= racter


On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar &l= t;sugandha.n87@= gmail.com> wrote:
Ok. Got it. Now I have= a single file which is of 129MB. Thus, it will be split into two blocks. N= ow, since my file is a json file, I cannot use textinputformat. As, every i= nput split(logical) will be a single line of the json file. Which I dont wa= nt. Thus, in this case, can I write a custom input format and a custom reco= rd reader so that, every input split(logical) will have only that part of d= ata which I require.

For. e.g:

{ "type": "Featur= e", "properties": { "OSM_NAME": "", &quo= t;FLAGS": 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE"= ;: 3.000000, "END_ID": 33451.000000, "OSM_META": "= ", "REVERSE_LE": 217.541279, "X1": 77.552595, &quo= t;OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OS= M_TARGET": 1520846293.000000, "X2": 77.554549, "Y2"= ;: 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107= , "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, &quo= t;OSM_ID": 138697535.000000, "START_ID": 33450.000000, "= ;KM": 0.000000, "LENGTH": 217.541279, "REVERSE__1"= : 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG": &quo= t;F" }, "geometry": { "type": "LineString&quo= t;, "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 863333= 2.869986, 1458938.970140 ] ] } }
,
{ "type": "Feature", "properties": { &qu= ot;OSM_NAME": "", "FLAGS": 3.000000, "CLAZZ&q= uot;: 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016= .000000, "OSM_META": "", "REVERSE_LE": 156.80= 6535, "X1": 77.538462, "OSM_SOURCE": 1037135286.000000,= "COST": 0.003052, "OSM_TARGET": 1551615728.000000, &qu= ot;X2": 77.537950, "Y2": 12.992099, "CONGESTED_": = 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, &q= uot;CONGESTION": 20.000000, "OSM_ID": 89417379.000000, "= ;START_ID": 24882.000000, "KM": 0.000000, "LENGTH"= : 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K": 5= 0.000000, "ROW_FLAG": "F" }, "geometry": { &q= uot;type": "LineString", "coordinates": [ [ 863154= 2.162393, 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }

I want here the every input split to consist of entire ty= pe data and thus, I can process it accordingly by giving relevant k,V pairs= to the map function.


--
Thanks &=A0Regards,
Sugandha Naolekar

<= br>



On Wed, Feb 26, 2014 at 2:09 AM, Mohamma= d Tariq <dontariq@gmail.com> wrote:
Hi Sugandha,

Please find my comments em= bedded below :

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 No. of mappers are deci= ded as: Total_File_Size/Max. Block Size. Thus, if the file is smaller than = the block size, only one mapper will be =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 =A0 invoked. Right?
=A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0=A0This is true(but not always)= . The basic criteria behind map creation is the logic inside getSplits method of InputFormat being used in your =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 =A0 MR job. It is the behavior of file based InputFormats, typically sub-classes of FileInputFormat, to split the input data= into splits based =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 on the total siz= e, in bytes, of the input files. See this for more details. And yes, if the file is smaller than= the block size then only 1 mapper will =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 be created.

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 If yes, it means, the map() will= be called only once. Right? In this case, if there are two datanodes with = a replication factor as 1: only one =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0= =A0 =A0 =A0 =A0 =A0 datanode(mapper machine) will perform the task. Right?=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0=A0A mapper is called for each spli= t. Don't get confused with the MR's split and HDFS's block. Bot= h are different(They may overlap though, as in =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 =A0 case of FileInputFormat). HDFS blocks are physical partitioning= of your data, while an InputSplit is just a logical partitioning. If you h= ave a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 file which is smaller tha= n the HDFS blocksize then only one split will be created, hence only 1 mapp= er will be called. And this will happen on =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 =A0 the node where this file resides.

=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = The map() function is called by all the datanodes/slaves right? If the no. = of mappers are more than the no. of slaves, what happens?
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 = =A0 map() doesn't get called by anybody. It rather gets created on the = node where the chunk of data to be processed resides. A slave node can run = =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 multiple mappers based on the a= vailability of CPU slots.

=A0 =A0 = =A0 =A0 =A0 =A0 =A0 =A0 =A0One more thing to ask: No. of blocks =3D no. of = mappers. Thus, those many no. of times the map() function will be called ri= ght?
= =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0No. of blocks =3D no. of splits =3D no. = of mappers. A map is called only once per split per node where that split i= s present.

HTH

Warm Regards,
Tariq


On Tue, Feb 25, 2014 at 3:54 PM, Sugandh= a Naolekar <sugandha.n87@gmail.com> wrote:
Hi Bertrand,
=
As you said, no. of HDFS= blocks =3D=A0 no. of input splits. But this is only true when you set isSp= littable() as false or when your input file size is less than the block siz= e. Also, when it comes to text files, the default textinputformat considers= each line as one input split which can be then read by RecordReader in K,V= format.

Please correct me if I don't make sense.

--
Thanks &=A0Regards= ,
Sugandha Naolekar

<= br>



On Tue, Feb 25, 2014 at = 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com> wrote:
The wiki (or Hadoop The Definitive Guide) are good ressour= ces.
https://www.in= kling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-format= s

Mapper is the name of the abstract class/interface. It = does not really make sense to talk about number of mappers.
A tas= k is a jvm that can be launched only if there is a free slot ie for a given= slot, at a given time, there will be at maximum only a single task. During= the task, the configured Mapper will be instantiated.

Always :
Number of input splits =3D no. of map tasks<= br>

And generally :
number of hdfs block= s =3D number of input splits

Regards

Bertrand

PS : I don't know if it is only my clien= t, but avoid red when writting a mail.

On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com&= gt; wrote:
Each node has a tasktracker with = a number of map slots. A map slot hosts as mapper. A mapper executes map ta= sks. If there are more map tasks than slots obviously there will be multipl= e rounds of mapping.

The map function is called once for each input record. A block is= typically 64MB and can contain a multitude of record, therefore a map task= =3D run the map() function on all records in the block.

Numbe= r of blocks =3D no. of map tasks (not mappers)

Furthermore you have to make a distinction between the two layers= . You have a layer for computations which consists of a jobtracker and a se= t of tasktrackers. The other layer is responsible for storage. The HDFS has= a namenode and a set of datanodes.

In mapreduce the code is executed where the data is. So if a bloc= k is in datanode 1, 2 and 3, then the map task associated with this block w= ill likely be executed on one of those physical nodes, by tasktracker 1, 2 = or 3. But this is not necessary, thing can be rearranged.

Hopefully this gives you a little more insigth.

Regards, D= ieter


2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com= >:

One more thing to ask: No. of blocks =3D no. of mappers. Thus, those many n= o. of times the map() function will be called right?

--
Thanks &=A0Regards,
Sugandha Naolekar

<= br>



On Tue, Feb 25, 2014 at = 11:27 AM, Sugandha Naolekar <sugandha.n87@gmail.com> wr= ote:
Hello,

As per the various articles I w= ent through till date, the File(s) are split in chunks/blocks. On the same = note, would like to ask few things:

  1. No. of mappers are decided as: Total_File_Size/Max. Block Size.= Thus, if the file is smaller than the block size, only one mapper will be = invoked. Right?
  2. If yes, it means, the map() will be called only onc= e. Right? In this case, if there are two datanodes with a replication facto= r as 1: only one datanode(mapper machine) will perform the task. Right?
  3. The map() function is called by all the datanodes/slaves right? If the = no. of mappers are more than the no. of slaves, what happens?
--
Thanks &=A0Regards,
Sugandha Naolekar

=



<= /span>
=

<= /span>

<= /span>

<= /span>







--001a11347b7c3e7f4604f35ba921--