Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: domain of sugandha.n87@gmail.com
 designates 209.85.217.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: 
 <CAMttJ_-7w1Fq6LU0_nCS-+2KmCM+Fvvu0CnuFLCG6xY0ULLsjg@mail.gmail.com>
References: 
 <CAGcQziEiLHQ4WiAQeiQByWCqqzxEx0M4G78e0tAw7AfgO+LR4g@mail.gmail.com>
 <CAGcQziE=XdT2yXkoJd0byyn5E+beOrbQF1sof+fNLHm5dmAO3g@mail.gmail.com>
 <CALSJUsS4Q4B9xEYMr5Gkn-b=65AEH6QVMu8yqxL1AtRVi7YBgg@mail.gmail.com>
 <CAO6W-2eKNRbTSN3cdHrmuxKfM1kUhc8+BnWzQihwcGK7t9iT5Q@mail.gmail.com>
 <CAGcQziGKH2zFYU7pjVT8LynYLyTtEwBft3kJoSptwHqMBFihvw@mail.gmail.com>
 <CAMVC6RM+3n+bO0XCnHe+uKL6mtAVW9uf3An+eA2jp0dCuSDObg@mail.gmail.com>
 <CAGcQziEBAnhPE9MgNSW_LZ36T_5Naj6N3Ez1M_g12GvUMGCA+w@mail.gmail.com>
 <CANMqDAiRuiFmKyPC--2CCVHfoNHL1J+pMs2maHdt-28LaFRpwg@mail.gmail.com>
 <CAMVC6RO0qP0DhFhNCY0zFW=gnfWtp4Zyh1YL45oO6DiO5_KEWw@mail.gmail.com>
 <CAGcQziHo7gbPi7yWkrxG5rNcKv5-z7XB-xksF4NLRMR0+_P8+w@mail.gmail.com>
 <CAMVC6ROUHpSuGf28ZMJv1uxEMfac6YO47jtbSar7XRZAXvog_Q@mail.gmail.com>
 <CAMttJ_-7w1Fq6LU0_nCS-+2KmCM+Fvvu0CnuFLCG6xY0ULLsjg@mail.gmail.com>
From: Sugandha Naolekar <sugandha.n87@gmail.com>
Date: Thu, 27 Feb 2014 09:51:13 +0530
Message-ID: 
 <CAGcQziHUiaKH0QpREiAAn6fgRDgggL1i9AUdFAu_Rt12syJU8A@mail.gmail.com>
Subject: Re: Mappers vs. Map tasks
To: "core-user@hadoop.apache.org" <user@hadoop.apache.org>
Content-Type: multipart/alternative; boundary=001a11347b7c3e7f4604f35ba921

--001a11347b7c3e7f4604f35ba921
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Joao Paulo,

Your suggestion is appreciated. Although, on a side note, what is more
tedious: Writing a custom InputFormat or changing the code which is
generating the input splits.?

--
Thanks & Regards,
Sugandha Naolekar


On Wed, Feb 26, 2014 at 8:03 PM, Jo=E3o Paulo Forny <jpforny@gmail.com> wro=
te:

> If I understood your problem correctly, you have one huge JSON, which is
> basically a JSONArray, and you want to process one JSONObject of the arra=
y
> at a time.
>
> I have faced the same issue some time ago and instead of changing the
> input format, I changed the code that was generating this input, to
> generate lots of JSONObjects, one per line. Hence, using the default
> TextInputFormat, the map function was getting called with the entire JSON=
.
>
> A JSONArray is not good for a mapreduce input since it has a first [ and =
a
> last ] and commas between the JSONs of the array. The array can be
> represented as the file that the JSONs belong.
>
> Of course, this approach works only if you can modify what is generating
> the input you're talking about.
>
>
> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <dontariq@gmail.com>:
>
> In that case you have to convert your JSON data into seq files first and
>> then do the processing.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Can I use SequenceFileInputFormat to do the same?
>>>
>>>  --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <dontariq@gmail.com>wro=
te:
>>>
>>>> Since there is no OOTB feature that allows this, you have to write you=
r
>>>> custom InputFormat to handle JSON data. Alternatively you could make u=
se of
>>>> Pig or Hive as they have builtin JSON support.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>> rajeshnagaraju@gmail.com> wrote:
>>>>
>>>>> 1 simple way is to remove the new line characters so that the default
>>>>> record reader and default way the block is read will take care of the=
 input
>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it wil=
l
>>>>>> be split into two blocks. Now, since my file is a json file, I canno=
t use
>>>>>> textinputformat. As, every input split(logical) will be a single lin=
e of
>>>>>> the json file. Which I dont want. Thus, in this case, can I write a =
custom
>>>>>> input format and a custom record reader so that, every input split(l=
ogical)
>>>>>> will have only that part of data which I require.
>>>>>>
>>>>>> For. e.g:
>>>>>>
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552=
595,
>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000, "KM=
":
>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_=
K":
>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.86998=
6,
>>>>>> 1458938.970140 ] ] } }
>>>>>> ,
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538=
462,
>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM"=
:
>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_=
K":
>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.14455=
0,
>>>>>> 1458829.592709 ] ] } }
>>>>>>
>>>>>> *I want here the every input split to consist of entire type data an=
d
>>>>>> thus, I can process it accordingly by giving relevant k,V pairs to t=
he map
>>>>>> function.*
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com>=
wrote:
>>>>>>
>>>>>>> Hi Sugandha,
>>>>>>>
>>>>>>> Please find my comments embedded below :
>>>>>>>
>>>>>>>                   No. of mappers are decided as:
>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller than =
the
>>>>>>> block size, only one mapper will be                               i=
nvoked.
>>>>>>> Right?
>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>> *InputFormat* being used in your                     MR job. It is
>>>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>>               on the total size, in bytes, of the input files. See
>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop=
/mapreduce/InputFormat.html>for more details. And yes, if the file is small=
er than the block size then
>>>>>>> only 1 mapper will                     be created.
>>>>>>>
>>>>>>>                   If yes, it means, the map() will be called only
>>>>>>> once. Right? In this case, if there are two datanodes with a replic=
ation
>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>> machine) will perform the task. Right?
>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>> confused with the MR's split and HDFS's block. Both are different(T=
hey may
>>>>>>> overlap though, as in                     case of FileInputFormat).=
 HDFS
>>>>>>> blocks are physical partitioning of your data, while an InputSplit =
is just
>>>>>>> a logical partitioning. If you have a                       file wh=
ich is
>>>>>>> smaller than the HDFS blocksize then only one split will be created=
, hence
>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>> the node where this file resides.
>>>>>>>
>>>>>>>                   The map() function is called by all the
>>>>>>> datanodes/slaves right? If the no. of mappers are more than the no.=
 of
>>>>>>> slaves, what happens?
>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>> gets created on the node where the chunk of data to be processed re=
sides. A
>>>>>>> slave node can run                       multiple mappers based on =
the
>>>>>>> availability of CPU slots.
>>>>>>>
>>>>>>>                  One more thing to ask: No. of blocks =3D no. of
>>>>>>> mappers. Thus, those many no. of times the map() function will be c=
alled
>>>>>>> right?
>>>>>>>                  No. of blocks =3D no. of splits =3D no. of mappers=
. A
>>>>>>> map is called only once per split per node where that split is pres=
ent.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Bertrand,
>>>>>>>>
>>>>>>>> As you said, no. of HDFS blocks =3D  no. of input splits. But this=
 is
>>>>>>>> only true when you set isSplittable() as false or when your input =
file size
>>>>>>>> is less than the block size. Also, when it comes to text files, th=
e default
>>>>>>>> textinputformat considers each line as one input split which can b=
e then
>>>>>>>> read by RecordReader in K,V format.
>>>>>>>>
>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>
>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3r=
d/chapter-7/input-formats
>>>>>>>>>
>>>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>> A task is a jvm that can be launched only if there is a free slot
>>>>>>>>> ie for a given slot, at a given time, there will be at maximum on=
ly a
>>>>>>>>> single task. During the task, the configured Mapper will be insta=
ntiated.
>>>>>>>>>
>>>>>>>>> Always :
>>>>>>>>> Number of input splits =3D no. of map tasks
>>>>>>>>>
>>>>>>>>> And generally :
>>>>>>>>> number of hdfs blocks =3D number of input splits
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Bertrand
>>>>>>>>>
>>>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>>>> writting a mail.
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Each node has a tasktracker with a number of map slots. A map
>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks. If there are =
more map
>>>>>>>>>> tasks than slots obviously there will be multiple rounds of mapp=
ing.
>>>>>>>>>>
>>>>>>>>>> The map function is called once for each input record. A block i=
s
>>>>>>>>>> typically 64MB and can contain a multitude of record, therefore =
a map task
>>>>>>>>>> =3D run the map() function on all records in the block.
>>>>>>>>>>
>>>>>>>>>> Number of blocks =3D no. of map tasks (not mappers)
>>>>>>>>>>
>>>>>>>>>> Furthermore you have to make a distinction between the two
>>>>>>>>>> layers. You have a layer for computations which consists of a jo=
btracker
>>>>>>>>>> and a set of tasktrackers. The other layer is responsible for st=
orage. The
>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>
>>>>>>>>>> In mapreduce the code is executed where the data is. So if a
>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task associated wi=
th this
>>>>>>>>>> block will likely be executed on one of those physical nodes, by
>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary, thing can be r=
earranged.
>>>>>>>>>>
>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>
>>>>>>>>>> Regards, Dieter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>  One more thing to ask: No. of blocks =3D no. of mappers. Thus,
>>>>>>>>>>> those many no. of times the map() function will be called right=
?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> As per the various articles I went through till date, the
>>>>>>>>>>>> File(s) are split in chunks/blocks. On the same note, would li=
ke to ask few
>>>>>>>>>>>> things:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>    Block Size. Thus, if the file is smaller than the block siz=
e, only one
>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>    2. If yes, it means, the map() will be called only once.
>>>>>>>>>>>>    Right? In this case, if there are two datanodes with a repl=
ication factor
>>>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform the ta=
sk. Right?
>>>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>>>    right? If the no. of mappers are more than the no. of slave=
s, what happens?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

--001a11347b7c3e7f4604f35ba921
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:georgia,=
serif;color:rgb(255,0,0)">Joao Paulo,<br><br></div><div class=3D"gmail_defa=
ult" style=3D"font-family:georgia,serif;color:rgb(255,0,0)">Your suggestion=
 is appreciated. Although, on a side note, what is more tedious: Writing a =
custom InputFormat or changing the code which is generating the input split=
s.?<br>

</div></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div dir=3D"l=
tr"><div><font face=3D"georgia, serif" color=3D"#674ea7">--</font></div><di=
v><font face=3D"georgia, serif" color=3D"#674ea7"><span style=3D"border-col=
lapse:collapse;font-size:13px">Thanks &amp;=A0Regards,<br>

Sugandha Naolekar<br></span><br></font><div><font face=3D"georgia, serif" c=
olor=3D"#674ea7"><br></font><div><div style=3D"font-size:10px"><br></div></=
div></div></div></div></div>
<br><br><div class=3D"gmail_quote">On Wed, Feb 26, 2014 at 8:03 PM, Jo=E3o =
Paulo Forny <span dir=3D"ltr">&lt;<a href=3D"mailto:jpforny@gmail.com" targ=
et=3D"_blank">jpforny@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex">

<div dir=3D"ltr">If I understood your problem correctly, you have one huge =
JSON, which is basically a JSONArray, and you want to process one JSONObjec=
t of the array at a time.<div><br></div><div>I have faced the same issue so=
me time ago and instead of changing the input format, I changed the code th=
at was generating this input, to generate lots of JSONObjects, one per line=
. Hence, using the default TextInputFormat, the map function was getting ca=
lled with the entire JSON.</div>


<div><br></div><div>A JSONArray is not good for a mapreduce input since it =
has a first [ and a last ] and commas between the JSONs of the array. The a=
rray can be represented as the file that the JSONs belong.</div><div><br>


</div><div>Of course, this approach works only if you can modify what is ge=
nerating the input you&#39;re talking about.</div></div><div class=3D"gmail=
_extra"><br><br><div class=3D"gmail_quote">2014-02-26 8:25 GMT-03:00 Mohamm=
ad Tariq <span dir=3D"ltr">&lt;<a href=3D"mailto:dontariq@gmail.com" target=
=3D"_blank">dontariq@gmail.com</a>&gt;</span>:<div>

<div class=3D"h5"><br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr">In that case you have to co=
nvert your JSON data into seq files first and then do the processing.</div>=
<div class=3D"gmail_extra">


<br clear=3D"all"><div><div dir=3D"ltr">Warm Regards,<div>Tariq</div><div><=
a href=3D"http://cloudfront.blogspot.com" target=3D"_blank">cloudfront.blog=
spot.com</a><br>

</div></div></div><div><div>
<br><br><div class=3D"gmail_quote">On Wed, Feb 26, 2014 at 4:43 PM, Sugandh=
a Naolekar <span dir=3D"ltr">&lt;<a href=3D"mailto:sugandha.n87@gmail.com" =
target=3D"_blank">sugandha.n87@gmail.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc so=
lid;padding-left:1ex">


<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:georgia,=
serif;color:rgb(255,0,0)">Can I use SequenceFileInputFormat to do the same?=
<br></div></div><div class=3D"gmail_extra"><div><br clear=3D"all"><div>
<div dir=3D"ltr">
<div>

<font color=3D"#674ea7" face=3D"georgia, serif">--</font></div><div><font c=
olor=3D"#674ea7" face=3D"georgia, serif"><span style=3D"border-collapse:col=
lapse;font-size:13px">Thanks &amp;=A0Regards,<br>Sugandha Naolekar<br></spa=
n><br></font><div>


<font color=3D"#674ea7" face=3D"georgia, serif"><br></font><div><div style=
=3D"font-size:10px"><br></div></div></div></div></div></div>
<br><br></div><div><div><div class=3D"gmail_quote">On Wed, Feb 26, 2014 at =
4:38 PM, Mohammad Tariq <span dir=3D"ltr">&lt;<a href=3D"mailto:dontariq@gm=
ail.com" target=3D"_blank">dontariq@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex">

<div dir=3D"ltr">Since there is no OOTB feature that allows this, you have =
to write your custom InputFormat to handle JSON data. Alternatively you cou=
ld make use of Pig or Hive as they have builtin JSON support.</div><div cla=
ss=3D"gmail_extra">


<br clear=3D"all"><div><div dir=3D"ltr">Warm Regards,<div>Tariq</div><div><=
a href=3D"http://cloudfront.blogspot.com" target=3D"_blank">cloudfront.blog=
spot.com</a><br></div></div></div><div><div>
<br><br><div class=3D"gmail_quote">On Wed, Feb 26, 2014 at 10:07 AM, Rajesh=
 Nagaraju <span dir=3D"ltr">&lt;<a href=3D"mailto:rajeshnagaraju@gmail.com"=
 target=3D"_blank">rajeshnagaraju@gmail.com</a>&gt;</span> wrote:<br><block=
quote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc=
 solid;padding-left:1ex">


<div dir=3D"ltr">1 simple way is to remove the new line characters so that =
the default record reader and default way the block is read will take care =
of the input splits and JSON will not get affected by the removal of NL cha=
racter<br>


</div><div><div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quot=
e">On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <span dir=3D"ltr">&l=
t;<a href=3D"mailto:sugandha.n87@gmail.com" target=3D"_blank">sugandha.n87@=
gmail.com</a>&gt;</span> wrote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1p=
x #ccc solid;padding-left:1ex"><div dir=3D"ltr"><div class=3D"gmail_default=
" style=3D"font-family:georgia,serif;display:inline">Ok. Got it. Now I have=
 a single file which is of 129MB. Thus, it will be split into two blocks. N=
ow, since my file is a json file, I cannot use textinputformat. As, every i=
nput split(logical) will be a single line of the json file. Which I dont wa=
nt. Thus, in this case, can I write a custom input format and a custom reco=
rd reader so that, every input split(logical) will have only that part of d=
ata which I require.<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:georgia,serif;d=
isplay:inline">For. e.g:<br><br></div><div class=3D"gmail_default" style=3D=
"font-family:georgia,serif;display:inline">{ &quot;type&quot;: &quot;Featur=
e&quot;, &quot;properties&quot;: { &quot;OSM_NAME&quot;: &quot;&quot;, &quo=
t;FLAGS&quot;: 3.000000, &quot;CLAZZ&quot;: 42.000000, &quot;ROAD_TYPE&quot=
;: 3.000000, &quot;END_ID&quot;: 33451.000000, &quot;OSM_META&quot;: &quot;=
&quot;, &quot;REVERSE_LE&quot;: 217.541279, &quot;X1&quot;: 77.552595, &quo=
t;OSM_SOURCE&quot;: 1520846283.000000, &quot;COST&quot;: 0.007058, &quot;OS=
M_TARGET&quot;: 1520846293.000000, &quot;X2&quot;: 77.554549, &quot;Y2&quot=
;: 12.993056, &quot;CONGESTED_&quot;: 227.541279, &quot;Y1&quot;: 12.993107=
, &quot;REVERSE_CO&quot;: 0.007058, &quot;CONGESTION&quot;: 10.000000, &quo=
t;OSM_ID&quot;: 138697535.000000, &quot;START_ID&quot;: 33450.000000, &quot=
;KM&quot;: 0.000000, &quot;LENGTH&quot;: 217.541279, &quot;REVERSE__1&quot;=
: 227.541279, &quot;SPEED_IN_K&quot;: 30.000000, &quot;ROW_FLAG&quot;: &quo=
t;F&quot; }, &quot;geometry&quot;: { &quot;type&quot;: &quot;LineString&quo=
t;, &quot;coordinates&quot;: [ [ 8633115.407361, 1458944.819456 ], [ 863333=
2.869986, 1458938.970140 ] ] } }<br>


,<br>{ &quot;type&quot;: &quot;Feature&quot;, &quot;properties&quot;: { &qu=
ot;OSM_NAME&quot;: &quot;&quot;, &quot;FLAGS&quot;: 3.000000, &quot;CLAZZ&q=
uot;: 32.000000, &quot;ROAD_TYPE&quot;: 3.000000, &quot;END_ID&quot;: 37016=
.000000, &quot;OSM_META&quot;: &quot;&quot;, &quot;REVERSE_LE&quot;: 156.80=
6535, &quot;X1&quot;: 77.538462, &quot;OSM_SOURCE&quot;: 1037135286.000000,=
 &quot;COST&quot;: 0.003052, &quot;OSM_TARGET&quot;: 1551615728.000000, &qu=
ot;X2&quot;: 77.537950, &quot;Y2&quot;: 12.992099, &quot;CONGESTED_&quot;: =
176.806535, &quot;Y1&quot;: 12.993377, &quot;REVERSE_CO&quot;: 0.003052, &q=
uot;CONGESTION&quot;: 20.000000, &quot;OSM_ID&quot;: 89417379.000000, &quot=
;START_ID&quot;: 24882.000000, &quot;KM&quot;: 0.000000, &quot;LENGTH&quot;=
: 156.806535, &quot;REVERSE__1&quot;: 176.806535, &quot;SPEED_IN_K&quot;: 5=
0.000000, &quot;ROW_FLAG&quot;: &quot;F&quot; }, &quot;geometry&quot;: { &q=
uot;type&quot;: &quot;LineString&quot;, &quot;coordinates&quot;: [ [ 863154=
2.162393, 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }<br>


<br></div><div class=3D"gmail_default" style=3D"font-family:georgia,serif;d=
isplay:inline"><b>I want here the every input split to consist of entire ty=
pe data and thus, I can process it accordingly by giving relevant k,V pairs=
 to the map function.</b><br>


</div><div class=3D"gmail_default" style=3D"font-family:georgia,serif;displ=
ay:inline"><br></div><div class=3D"gmail_extra"><br clear=3D"all"><div><div=
 dir=3D"ltr"><div><font face=3D"georgia, serif">--</font></div><div><font f=
ace=3D"georgia, serif"><span style=3D"border-collapse:collapse;font-size:13=
px">Thanks &amp;=A0Regards,<br>


Sugandha Naolekar<br></span><br></font><div><font face=3D"georgia, serif"><=
br></font><div><div style=3D"font-size:10px"><br></div></div></div></div></=
div></div>
<br><br><div class=3D"gmail_quote">On Wed, Feb 26, 2014 at 2:09 AM, Mohamma=
d Tariq <span dir=3D"ltr">&lt;<a href=3D"mailto:dontariq@gmail.com" target=
=3D"_blank">dontariq@gmail.com</a>&gt;</span> wrote:<br><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex">


<div dir=3D"ltr">Hi Sugandha,<div><br></div><div>Please find my comments em=
bedded below :</div><div><div><span style=3D"font-family:georgia,serif;font=
-size:13px"><br></span></div><div><span style=3D"font-family:georgia,serif;=
font-size:13px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 No. of mappers are deci=
ded as: Total_File_Size/Max. Block Size. Thus, if the file is smaller than =
the block size, only one mapper will be =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 =A0 invoked. Right?</span></div>


</div><div><div><font face=3D"georgia, serif">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0=A0</font><font face=3D"georgia, serif">This is true(but not always)=
. The basic criteria behind map creation is the logic inside <b>getSplits</=
b> method of <b>InputFormat</b> being used in your =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 MR job. It is the behavior of <b>file based InputFormats</b=
>, typically sub-classes of <b>FileInputFormat</b>, to split the input data=
 into splits based =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 on the total siz=
e, in bytes, of the input files. See <a href=3D"http://hadoop.apache.org/do=
cs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html" target=3D"_bl=
ank"><b>this</b></a> for more details. And yes, if the file is smaller than=
 the block size then only 1 mapper will =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 be created.</font></div>


</div><div><div><div><span style=3D"font-family:georgia,serif;font-size:13p=
x"><br></span></div><div><span style=3D"font-family:georgia,serif;font-size=
:13px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 If yes, it means, the map() will=
 be called only once. Right? In this case, if there are two datanodes with =
a replication factor as 1: only one =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0 datanode(mapper machine) will perform the task. Right?=
</span></div>


</div><div><font face=3D"georgia, serif">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0=A0</font><font face=3D"georgia, serif">A mapper is called for each spli=
t. Don&#39;t get confused with the MR&#39;s split and HDFS&#39;s block. Bot=
h are different(They may overlap though, as in =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 case of FileInputFormat). HDFS blocks are physical partitioning=
 of your data, while an InputSplit is just a logical partitioning. If you h=
ave a =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 file which is smaller tha=
n the HDFS blocksize then only one split will be created, hence only 1 mapp=
er will be called. And this will happen on =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 the node where this file resides.</font></div>


<div>


<div><font face=3D"georgia, serif"><br></font></div><div><span style=3D"fon=
t-family:georgia,serif;font-size:13px">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
The map() function is called by all the datanodes/slaves right? If the no. =
of mappers are more than the no. of slaves, what happens?</span></div>


</div><div><font face=3D"georgia, serif">=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 map() doesn&#39;t get called by anybody. It rather gets created on the =
node where the chunk of data to be processed resides. A slave node can run =
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 multiple mappers based on the a=
vailability of CPU slots.</font></div>


<div>


<div><span style=3D"font-family:georgia,serif;font-size:13px"><br></span></=
div><div><span style=3D"font-family:georgia,serif;font-size:13px">=A0 =A0 =
=A0 =A0 =A0 =A0 =A0 =A0 =A0One more thing to ask: No. of blocks =3D no. of =
mappers. Thus, those many no. of times the map() function will be called ri=
ght?</span></div>


</div></div><div><span style=3D"font-family:georgia,serif;font-size:13px">=
=A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0No. of blocks =3D no. of splits =3D no. =
of mappers. A map is called only once per split per node where that split i=
s present.</span></div>


<div><span style=3D"font-family:georgia,serif;font-size:13px"><br></span></=
div><div><span style=3D"font-family:georgia,serif;font-size:13px">HTH</span=
></div><div class=3D"gmail_extra">


<br clear=3D"all"><div><div dir=3D"ltr">Warm Regards,<div>Tariq</div><div><=
span><a href=3D"http://cloudfront.blogspot.com" target=3D"_blank">cloudfron=
t.blogspot.com</a><br></span></div></div></div><div><div>


<br><br><div class=3D"gmail_quote">On Tue, Feb 25, 2014 at 3:54 PM, Sugandh=
a Naolekar <span dir=3D"ltr">&lt;<a href=3D"mailto:sugandha.n87@gmail.com" =
target=3D"_blank">sugandha.n87@gmail.com</a>&gt;</span> wrote:<br><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">


<div dir=3D"ltr"><div style=3D"font-family:georgia,serif">Hi Bertrand, <br>=
<br></div><div style=3D"font-family:georgia,serif">As you said, no. of HDFS=
 blocks =3D=A0 no. of input splits. But this is only true when you set isSp=
littable() as false or when your input file size is less than the block siz=
e. Also, when it comes to text files, the default textinputformat considers=
 each line as one input split which can be then read by RecordReader in K,V=
 format. <br>


<br>Please correct me if I don&#39;t make sense.<br></div></div><div class=
=3D"gmail_extra"><div><br clear=3D"all"><div><div dir=3D"ltr"><div><font fa=
ce=3D"georgia, serif">--</font></div><div><font face=3D"georgia, serif"><sp=
an style=3D"border-collapse:collapse;font-size:13px">Thanks &amp;=A0Regards=
,<br>


Sugandha Naolekar<br></span><br></font><div><font face=3D"georgia, serif"><=
br></font><div><div style=3D"font-size:10px"><br></div></div></div></div></=
div></div>
<br><br></div><div><div><div class=3D"gmail_quote">On Tue, Feb 25, 2014 at =
2:07 PM, Bertrand Dechoux <span dir=3D"ltr">&lt;<a href=3D"mailto:dechouxb@=
gmail.com" target=3D"_blank">dechouxb@gmail.com</a>&gt;</span> wrote:<br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir=3D"ltr">The wiki (or Hadoop The Definitive Guide) are good ressour=
ces.<div><span><a href=3D"https://www.inkling.com/read/hadoop-definitive-gu=
ide-tom-white-3rd/chapter-7/input-formats" target=3D"_blank">https://www.in=
kling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-format=
s</a><br>


</span>


<div><br></div><div>Mapper is the name of the abstract class/interface. It =
does not really make sense to talk about number of mappers.</div><div>A tas=
k is a jvm that can be launched only if there is a free slot ie for a given=
 slot, at a given time, there will be at maximum only a single task. During=
 the task, the configured Mapper will be instantiated.</div>


<div><br></div><div>Always :</div><div><span style=3D"font-size:13px;font-f=
amily:arial,sans-serif">Number of input splits =3D no. of map tasks</span><=
br></div><div><br></div><div>And generally :</div><div>number of hdfs block=
s =3D number of input splits</div>


<div><br></div><div>Regards</div><span><div><br></div><div>Bertrand</div></=
span></div><div><br></div><div>PS : I don&#39;t know if it is only my clien=
t, but avoid red when writting a mail.</div>


<div><div><div class=3D"gmail_extra"><br><div class=3D"gmail_quote">
On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <span dir=3D"ltr">&lt;<a h=
ref=3D"mailto:drdwitte@gmail.com" target=3D"_blank">drdwitte@gmail.com</a>&=
gt;</span> wrote:<br><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div dir=3D"ltr"><div><div><div><div><div>Each node has a tasktracker with =
a number of map slots. A map slot hosts as mapper. A mapper executes map ta=
sks. If there are more map tasks than slots obviously there will be multipl=
e rounds of mapping.<br>


<br></div>The map function is called once for each input record. A block is=
 typically 64MB and can contain a multitude of record, therefore a map task=
 =3D run the map() function on all records in the block.<br><br></div>Numbe=
r of blocks =3D no. of map tasks (not mappers)<br>


<br></div>Furthermore you have to make a distinction between the two layers=
. You have a layer for computations which consists of a jobtracker and a se=
t of tasktrackers. The other layer is responsible for storage. The HDFS has=
 a namenode and a set of datanodes.<br>


<br></div>In mapreduce the code is executed where the data is. So if a bloc=
k is in datanode 1, 2 and 3, then the map task associated with this block w=
ill likely be executed on one of those physical nodes, by tasktracker 1, 2 =
or 3. But this is not necessary, thing can be rearranged.<br>


<br></div>Hopefully this gives you a little more insigth.<br><br>Regards, D=
ieter<br></div><div class=3D"gmail_extra"><br><br><div class=3D"gmail_quote=
">2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <span dir=3D"ltr">&lt;<a href=
=3D"mailto:sugandha.n87@gmail.com" target=3D"_blank">sugandha.n87@gmail.com=
</a>&gt;</span>:<div>


<div><br>
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div sty=
le=3D"font-family:georgia,serif">


One more thing to ask: No. of blocks =3D no. of mappers. Thus, those many n=
o. of times the map() function will be called right?<br>


</div></div><div class=3D"gmail_extra"><div><br clear=3D"all"><div><div dir=
=3D"ltr"><div><font face=3D"georgia, serif">--</font></div><div><font face=
=3D"georgia, serif"><span style=3D"border-collapse:collapse;font-size:13px"=
>Thanks &amp;=A0Regards,<br>


Sugandha Naolekar<br></span><br></font><div><font face=3D"georgia, serif"><=
br></font><div><div style=3D"font-size:10px"><br></div></div></div></div></=
div></div>
<br><br></div><div><div><div class=3D"gmail_quote">On Tue, Feb 25, 2014 at =
11:27 AM, Sugandha Naolekar <span dir=3D"ltr">&lt;<a href=3D"mailto:sugandh=
a.n87@gmail.com" target=3D"_blank">sugandha.n87@gmail.com</a>&gt;</span> wr=
ote:<br>


<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex">

<div dir=3D"ltr"><div style=3D"font-family:georgia,serif">Hello,<br><br></d=
iv><div style=3D"font-family:georgia,serif">As per the various articles I w=
ent through till date, the File(s) are split in chunks/blocks. On the same =
note, would like to ask few things:<br>


<br><ol><li>No. of mappers are decided as: Total_File_Size/Max. Block Size.=
 Thus, if the file is smaller than the block size, only one mapper will be =
invoked. Right?</li><li>If yes, it means, the map() will be called only onc=
e. Right? In this case, if there are two datanodes with a replication facto=
r as 1: only one datanode(mapper machine) will perform the task. Right?</li=
>


<li>The map() function is called by all the datanodes/slaves right? If the =
no. of mappers are more than the no. of slaves, what happens?<span><font co=
lor=3D"#888888"><br></font></span></li></ol></div><span><font color=3D"#888=
888"><div>


<div dir=3D"ltr"><div><font face=3D"georgia, serif">--</font></div>


<div><font face=3D"georgia, serif"><span style=3D"border-collapse:collapse;=
font-size:13px">Thanks &amp;=A0Regards,<br>Sugandha Naolekar<br></span><br>=
</font><div><font face=3D"georgia, serif"><br>


</font><div>
<div style=3D"font-size:10px"><br></div></div></div></div></div></div>
</font></span></div><span><font color=3D"#888888">
</font></span></blockquote></div><span><font color=3D"#888888"><br></font><=
/span></div></div></div><span><font color=3D"#888888">
</font></span></blockquote></div></div></div><span><font color=3D"#888888">=
<br></font></span></div><span><font color=3D"#888888">
</font></span></blockquote></div><span><font color=3D"#888888"><br></font><=
/span></div></div></div></div><span><font color=3D"#888888">
</font></span></blockquote></div><span><font color=3D"#888888"><br></font><=
/span></div></div></div><span><font color=3D"#888888">
</font></span></blockquote></div><span><font color=3D"#888888"><br></font><=
/span></div></div></div></div>
</blockquote></div><br></div></div>
</blockquote></div><br></div>
</div></div></blockquote></div><br></div></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div><br></div></div></div>
</blockquote></div></div></div><br></div>
</blockquote></div><br></div>

--001a11347b7c3e7f4604f35ba921--