hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sugandha Naolekar <sugandha....@gmail.com>
Subject Re: Mappers vs. Map tasks
Date Wed, 26 Feb 2014 11:13:31 GMT
Can I use SequenceFileInputFormat to do the same?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <dontariq@gmail.com> wrote:

> Since there is no OOTB feature that allows this, you have to write your
> custom InputFormat to handle JSON data. Alternatively you could make use of
> Pig or Hive as they have builtin JSON support.
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
> rajeshnagaraju@gmail.com> wrote:
>
>> 1 simple way is to remove the new line characters so that the default
>> record reader and default way the block is read will take care of the input
>> splits and JSON will not get affected by the removal of NL character
>>
>>
>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will be
>>> split into two blocks. Now, since my file is a json file, I cannot use
>>> textinputformat. As, every input split(logical) will be a single line of
>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>> input format and a custom record reader so that, every input split(logical)
>>> will have only that part of data which I require.
>>>
>>> For. e.g:
>>>
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] } }
>>> ,
>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>
>>> *I want here the every input split to consist of entire type data and
>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>> function.*
>>>
>>>
>>> --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>
>>>> Hi Sugandha,
>>>>
>>>> Please find my comments embedded below :
>>>>
>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>> mapper will be                               invoked. Right?
>>>>                   This is true(but not always). The basic criteria
>>>> behind map creation is the logic inside *getSplits* method of
>>>> *InputFormat* being used in your                     MR job. It is the
>>>> behavior of *file based InputFormats*, typically sub-classes of
>>>> *FileInputFormat*, to split the input data into splits based
>>>>           on the total size, in bytes, of the input files. See *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
>>>> only 1 mapper will                     be created.
>>>>
>>>>                   If yes, it means, the map() will be called only once.
>>>> Right? In this case, if there are two datanodes with a replication factor
>>>> as 1: only one                               datanode(mapper machine) will
>>>> perform the task. Right?
>>>>                   A mapper is called for each split. Don't get
>>>> confused with the MR's split and HDFS's block. Both are different(They may
>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>> blocks are physical partitioning of your data, while an InputSplit is just
>>>> a logical partitioning. If you have a                       file which is
>>>> smaller than the HDFS blocksize then only one split will be created, hence
>>>> only 1 mapper will be called. And this will happen on
>>>> the node where this file resides.
>>>>
>>>>                   The map() function is called by all the
>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>> slaves, what happens?
>>>>                   map() doesn't get called by anybody. It rather gets
>>>> created on the node where the chunk of data to be processed resides. A
>>>> slave node can run                       multiple mappers based on the
>>>> availability of CPU slots.
>>>>
>>>>                  One more thing to ask: No. of blocks = no. of mappers.
>>>> Thus, those many no. of times the map() function will be called right?
>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>> is called only once per split per node where that split is present.
>>>>
>>>> HTH
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>> sugandha.n87@gmail.com> wrote:
>>>>
>>>>> Hi Bertrand,
>>>>>
>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this is
>>>>> only true when you set isSplittable() as false or when your input file
size
>>>>> is less than the block size. Also, when it comes to text files, the default
>>>>> textinputformat considers each line as one input split which can be then
>>>>> read by RecordReader in K,V format.
>>>>>
>>>>> Please correct me if I don't make sense.
>>>>>
>>>>> --
>>>>> Thanks & Regards,
>>>>> Sugandha Naolekar
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com>wrote:
>>>>>
>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>
>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>
>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>> really make sense to talk about number of mappers.
>>>>>> A task is a jvm that can be launched only if there is a free slot
ie
>>>>>> for a given slot, at a given time, there will be at maximum only
a single
>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>
>>>>>> Always :
>>>>>> Number of input splits = no. of map tasks
>>>>>>
>>>>>> And generally :
>>>>>> number of hdfs blocks = number of input splits
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Bertrand
>>>>>>
>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>> writting a mail.
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com>wrote:
>>>>>>
>>>>>>> Each node has a tasktracker with a number of map slots. A map
slot
>>>>>>> hosts as mapper. A mapper executes map tasks. If there are more
map tasks
>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>
>>>>>>> The map function is called once for each input record. A block
is
>>>>>>> typically 64MB and can contain a multitude of record, therefore
a map task
>>>>>>> = run the map() function on all records in the block.
>>>>>>>
>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>
>>>>>>> Furthermore you have to make a distinction between the two layers.
>>>>>>> You have a layer for computations which consists of a jobtracker
and a set
>>>>>>> of tasktrackers. The other layer is responsible for storage.
The HDFS has a
>>>>>>> namenode and a set of datanodes.
>>>>>>>
>>>>>>> In mapreduce the code is executed where the data is. So if a
block
>>>>>>> is in datanode 1, 2 and 3, then the map task associated with
this block
>>>>>>> will likely be executed on one of those physical nodes, by tasktracker
1, 2
>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>
>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>
>>>>>>> Regards, Dieter
>>>>>>>
>>>>>>>
>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com>
>>>>>>> :
>>>>>>>
>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
those
>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> As per the various articles I went through till date,
the File(s)
>>>>>>>>> are split in chunks/blocks. On the same note, would like
to ask few things:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
Block
>>>>>>>>>    Size. Thus, if the file is smaller than the block
size, only one mapper
>>>>>>>>>    will be invoked. Right?
>>>>>>>>>    2. If yes, it means, the map() will be called only
once.
>>>>>>>>>    Right? In this case, if there are two datanodes with
a replication factor
>>>>>>>>>    as 1: only one datanode(mapper machine) will perform
the task. Right?
>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>    right? If the no. of mappers are more than the no.
of slaves, what happens?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message