hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sugandha Naolekar <sugandha....@gmail.com>
Subject Re: Mappers vs. Map tasks
Date Wed, 26 Feb 2014 04:31:22 GMT
Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
split into two blocks. Now, since my file is a json file, I cannot use
textinputformat. As, every input split(logical) will be a single line of
the json file. Which I dont want. Thus, in this case, can I write a custom
input format and a custom record reader so that, every input split(logical)
will have only that part of data which I require.

For. e.g:

{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
"OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
"REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
"F" }, "geometry": { "type": "LineString", "coordinates": [ [
8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
,
{ "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
"CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
"OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
"REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
"START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
"REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
"geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }

*I want here the every input split to consist of entire type data and thus,
I can process it accordingly by giving relevant k,V pairs to the map
function.*


--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com> wrote:

> Hi Sugandha,
>
> Please find my comments embedded below :
>
>                   No. of mappers are decided as: Total_File_Size/Max.
> Block Size. Thus, if the file is smaller than the block size, only one
> mapper will be                               invoked. Right?
>                   This is true(but not always). The basic criteria behind
> map creation is the logic inside *getSplits* method of *InputFormat*being used in your
                    MR job. It is the behavior of *file
> based InputFormats*, typically sub-classes of *FileInputFormat*, to split
> the input data into splits based                     on the total size, in
> bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
> only 1 mapper will                     be created.
>
>                   If yes, it means, the map() will be called only once.
> Right? In this case, if there are two datanodes with a replication factor
> as 1: only one                               datanode(mapper machine) will
> perform the task. Right?
>                   A mapper is called for each split. Don't get confused
> with the MR's split and HDFS's block. Both are different(They may overlap
> though, as in                     case of FileInputFormat). HDFS blocks are
> physical partitioning of your data, while an InputSplit is just a logical
> partitioning. If you have a                       file which is smaller
> than the HDFS blocksize then only one split will be created, hence only 1
> mapper will be called. And this will happen on                     the node
> where this file resides.
>
>                   The map() function is called by all the datanodes/slaves
> right? If the no. of mappers are more than the no. of slaves, what happens?
>                   map() doesn't get called by anybody. It rather gets
> created on the node where the chunk of data to be processed resides. A
> slave node can run                       multiple mappers based on the
> availability of CPU slots.
>
>                  One more thing to ask: No. of blocks = no. of mappers.
> Thus, those many no. of times the map() function will be called right?
>                  No. of blocks = no. of splits = no. of mappers. A map is
> called only once per split per node where that split is present.
>
> HTH
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <sugandha.n87@gmail.com
> > wrote:
>
>> Hi Bertrand,
>>
>> As you said, no. of HDFS blocks =  no. of input splits. But this is only
>> true when you set isSplittable() as false or when your input file size is
>> less than the block size. Also, when it comes to text files, the default
>> textinputformat considers each line as one input split which can be then
>> read by RecordReader in K,V format.
>>
>> Please correct me if I don't make sense.
>>
>> --
>> Thanks & Regards,
>> Sugandha Naolekar
>>
>>
>>
>>
>>
>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:
>>
>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>
>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>
>>> Mapper is the name of the abstract class/interface. It does not really
>>> make sense to talk about number of mappers.
>>> A task is a jvm that can be launched only if there is a free slot ie for
>>> a given slot, at a given time, there will be at maximum only a single task.
>>> During the task, the configured Mapper will be instantiated.
>>>
>>> Always :
>>> Number of input splits = no. of map tasks
>>>
>>> And generally :
>>> number of hdfs blocks = number of input splits
>>>
>>> Regards
>>>
>>> Bertrand
>>>
>>> PS : I don't know if it is only my client, but avoid red when writting a
>>> mail.
>>>
>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com>wrote:
>>>
>>>> Each node has a tasktracker with a number of map slots. A map slot
>>>> hosts as mapper. A mapper executes map tasks. If there are more map tasks
>>>> than slots obviously there will be multiple rounds of mapping.
>>>>
>>>> The map function is called once for each input record. A block is
>>>> typically 64MB and can contain a multitude of record, therefore a map task
>>>> = run the map() function on all records in the block.
>>>>
>>>> Number of blocks = no. of map tasks (not mappers)
>>>>
>>>> Furthermore you have to make a distinction between the two layers. You
>>>> have a layer for computations which consists of a jobtracker and a set of
>>>> tasktrackers. The other layer is responsible for storage. The HDFS has a
>>>> namenode and a set of datanodes.
>>>>
>>>> In mapreduce the code is executed where the data is. So if a block is
>>>> in datanode 1, 2 and 3, then the map task associated with this block will
>>>> likely be executed on one of those physical nodes, by tasktracker 1, 2 or
>>>> 3. But this is not necessary, thing can be rearranged.
>>>>
>>>> Hopefully this gives you a little more insigth.
>>>>
>>>> Regards, Dieter
>>>>
>>>>
>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>:
>>>>
>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus, those
>>>>> many no. of times the map() function will be called right?
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> As per the various articles I went through till date, the File(s)
are
>>>>>> split in chunks/blocks. On the same note, would like to ask few things:
>>>>>>
>>>>>>
>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max. Block
>>>>>>    Size. Thus, if the file is smaller than the block size, only one
mapper
>>>>>>    will be invoked. Right?
>>>>>>    2. If yes, it means, the map() will be called only once. Right?
>>>>>>    In this case, if there are two datanodes with a replication factor
as 1:
>>>>>>    only one datanode(mapper machine) will perform the task. Right?
>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>    right? If the no. of mappers are more than the no. of slaves,
what happens?
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message