hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Mappers vs. Map tasks
Date Thu, 27 Feb 2014 05:38:23 GMT
The code which generates split is part of the InputFormat, which is the
getSplits<http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/InputFormat.html#getSplits(org.apache.hadoop.mapred.JobConf,
int)> method.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Thu, Feb 27, 2014 at 9:58 AM, Rajesh Nagaraju
<rajeshnagaraju@gmail.com>wrote:

> Hi
>
> I have worked on a similiar project, I had a M/R for preprocessing that
> would remove the NL chars and ensure that 1 line has a json object.
> Then the consequent MRs do the processing. You can have a work flow
> defined for binding the MRs
>
> Thanks and Regards
> Rajesh Nagaraju
>
>
> On Thu, Feb 27, 2014 at 9:51 AM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Joao Paulo,
>>
>> Your suggestion is appreciated. Although, on a side note, what is more
>> tedious: Writing a custom InputFormat or changing the code which is
>> generating the input splits.?
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Wed, Feb 26, 2014 at 8:03 PM, João Paulo Forny <jpforny@gmail.com>wrote:
>>
>>> If I understood your problem correctly, you have one huge JSON, which is
>>> basically a JSONArray, and you want to process one JSONObject of the array
>>> at a time.
>>>
>>> I have faced the same issue some time ago and instead of changing the
>>> input format, I changed the code that was generating this input, to
>>> generate lots of JSONObjects, one per line. Hence, using the default
>>> TextInputFormat, the map function was getting called with the entire JSON.
>>>
>>> A JSONArray is not good for a mapreduce input since it has a first [ and
>>> a last ] and commas between the JSONs of the array. The array can be
>>> represented as the file that the JSONs belong.
>>>
>>> Of course, this approach works only if you can modify what is generating
>>> the input you're talking about.
>>>
>>>
>>> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <dontariq@gmail.com>:
>>>
>>> In that case you have to convert your JSON data into seq files first and
>>>> then do the processing.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Can I use SequenceFileInputFormat to do the same?
>>>>>
>>>>>  --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>
>>>>>> Since there is no OOTB feature that allows this, you have to write
>>>>>> your custom InputFormat to handle JSON data. Alternatively you could
make
>>>>>> use of Pig or Hive as they have builtin JSON support.
>>>>>>
>>>>>> Warm Regards,
>>>>>> Tariq
>>>>>> cloudfront.blogspot.com
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>>>> rajeshnagaraju@gmail.com> wrote:
>>>>>>
>>>>>>> 1 simple way is to remove the new line characters so that the
>>>>>>> default record reader and default way the block is read will
take care of
>>>>>>> the input splits and JSON will not get affected by the removal
of NL
>>>>>>> character
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus,
it
>>>>>>>> will be split into two blocks. Now, since my file is a json
file, I cannot
>>>>>>>> use textinputformat. As, every input split(logical) will
be a single line
>>>>>>>> of the json file. Which I dont want. Thus, in this case,
can I write a
>>>>>>>> custom input format and a custom record reader so that, every
input
>>>>>>>> split(logical) will have only that part of data which I require.
>>>>>>>>
>>>>>>>> For. e.g:
>>>>>>>>
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1":
77.552595,
>>>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000,
"KM":
>>>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279,
"SPEED_IN_K":
>>>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>>>> 1458938.970140 ] ] } }
>>>>>>>> ,
>>>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1":
77.538462,
>>>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000,
"KM":
>>>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535,
"SPEED_IN_K":
>>>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>>>> 1458829.592709 ] ] } }
>>>>>>>>
>>>>>>>> *I want here the every input split to consist of entire type
data
>>>>>>>> and thus, I can process it accordingly by giving relevant
k,V pairs to the
>>>>>>>> map function.*
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi Sugandha,
>>>>>>>>>
>>>>>>>>> Please find my comments embedded below :
>>>>>>>>>
>>>>>>>>>                   No. of mappers are decided as:
>>>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is
smaller than the
>>>>>>>>> block size, only one mapper will be                 
             invoked.
>>>>>>>>> Right?
>>>>>>>>>                   This is true(but not always). The basic
>>>>>>>>> criteria behind map creation is the logic inside *getSplits*method
of
>>>>>>>>> *InputFormat* being used in your                    
MR job. It
>>>>>>>>> is the behavior of *file based InputFormats*, typically
>>>>>>>>> sub-classes of *FileInputFormat*, to split the input
data into
>>>>>>>>> splits based                     on the total size, in
bytes, of the input
>>>>>>>>> files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
>>>>>>>>> only 1 mapper will                     be created.
>>>>>>>>>
>>>>>>>>>                   If yes, it means, the map() will be
called only
>>>>>>>>> once. Right? In this case, if there are two datanodes
with a replication
>>>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>>>> machine) will perform the task. Right?
>>>>>>>>>                   A mapper is called for each split.
Don't get
>>>>>>>>> confused with the MR's split and HDFS's block. Both are
different(They may
>>>>>>>>> overlap though, as in                     case of FileInputFormat).
HDFS
>>>>>>>>> blocks are physical partitioning of your data, while
an InputSplit is just
>>>>>>>>> a logical partitioning. If you have a               
       file which is
>>>>>>>>> smaller than the HDFS blocksize then only one split will
be created, hence
>>>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>>>> the node where this file resides.
>>>>>>>>>
>>>>>>>>>                   The map() function is called by all
the
>>>>>>>>> datanodes/slaves right? If the no. of mappers are more
than the no. of
>>>>>>>>> slaves, what happens?
>>>>>>>>>                   map() doesn't get called by anybody.
It rather
>>>>>>>>> gets created on the node where the chunk of data to be
processed resides. A
>>>>>>>>> slave node can run                       multiple mappers
based on the
>>>>>>>>> availability of CPU slots.
>>>>>>>>>
>>>>>>>>>                  One more thing to ask: No. of blocks
= no. of
>>>>>>>>> mappers. Thus, those many no. of times the map() function
will be called
>>>>>>>>> right?
>>>>>>>>>                  No. of blocks = no. of splits = no.
of mappers. A
>>>>>>>>> map is called only once per split per node where that
split is present.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Warm Regards,
>>>>>>>>> Tariq
>>>>>>>>> cloudfront.blogspot.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Bertrand,
>>>>>>>>>>
>>>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits.
But this
>>>>>>>>>> is only true when you set isSplittable() as false
or when your input file
>>>>>>>>>> size is less than the block size. Also, when it comes
to text files, the
>>>>>>>>>> default textinputformat considers each line as one
input split which can be
>>>>>>>>>> then read by RecordReader in K,V format.
>>>>>>>>>>
>>>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux
<
>>>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are
good ressources.
>>>>>>>>>>>
>>>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>>>
>>>>>>>>>>> Mapper is the name of the abstract class/interface.
It does not
>>>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>>>> A task is a jvm that can be launched only if
there is a free
>>>>>>>>>>> slot ie for a given slot, at a given time, there
will be at maximum only a
>>>>>>>>>>> single task. During the task, the configured
Mapper will be instantiated.
>>>>>>>>>>>
>>>>>>>>>>> Always :
>>>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>>>
>>>>>>>>>>> And generally :
>>>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>>>
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>> Bertrand
>>>>>>>>>>>
>>>>>>>>>>> PS : I don't know if it is only my client, but
avoid red when
>>>>>>>>>>> writting a mail.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte
<
>>>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Each node has a tasktracker with a number
of map slots. A map
>>>>>>>>>>>> slot hosts as mapper. A mapper executes map
tasks. If there are more map
>>>>>>>>>>>> tasks than slots obviously there will be
multiple rounds of mapping.
>>>>>>>>>>>>
>>>>>>>>>>>> The map function is called once for each
input record. A block
>>>>>>>>>>>> is typically 64MB and can contain a multitude
of record, therefore a map
>>>>>>>>>>>> task = run the map() function on all records
in the block.
>>>>>>>>>>>>
>>>>>>>>>>>> Number of blocks = no. of map tasks (not
mappers)
>>>>>>>>>>>>
>>>>>>>>>>>> Furthermore you have to make a distinction
between the two
>>>>>>>>>>>> layers. You have a layer for computations
which consists of a jobtracker
>>>>>>>>>>>> and a set of tasktrackers. The other layer
is responsible for storage. The
>>>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>>>
>>>>>>>>>>>> In mapreduce the code is executed where the
data is. So if a
>>>>>>>>>>>> block is in datanode 1, 2 and 3, then the
map task associated with this
>>>>>>>>>>>> block will likely be executed on one of those
physical nodes, by
>>>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary,
thing can be rearranged.
>>>>>>>>>>>>
>>>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards, Dieter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar
<
>>>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>  One more thing to ask: No. of blocks = no.
of mappers. Thus,
>>>>>>>>>>>>> those many no. of times the map() function
will be called right?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha
Naolekar <
>>>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As per the various articles I went
through till date, the
>>>>>>>>>>>>>> File(s) are split in chunks/blocks.
On the same note, would like to ask few
>>>>>>>>>>>>>> things:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    1. No. of mappers are decided
as: Total_File_Size/Max.
>>>>>>>>>>>>>>    Block Size. Thus, if the file
is smaller than the block size, only one
>>>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>>>    2. If yes, it means, the map()
will be called only once.
>>>>>>>>>>>>>>    Right? In this case, if there
are two datanodes with a replication factor
>>>>>>>>>>>>>>    as 1: only one datanode(mapper
machine) will perform the task. Right?
>>>>>>>>>>>>>>    3. The map() function is called
by all the
>>>>>>>>>>>>>>    datanodes/slaves right? If the
no. of mappers are more than the no. of
>>>>>>>>>>>>>>    slaves, what happens?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message