hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Mappers vs. Map tasks
Date Wed, 26 Feb 2014 11:25:09 GMT
In that case you have to convert your JSON data into seq files first and
then do the processing.

Warm Regards,
Tariq
cloudfront.blogspot.com


On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar
<sugandha.n87@gmail.com>wrote:

> Can I use SequenceFileInputFormat to do the same?
>
>  --
> Thanks & Regards,
> Sugandha Naolekar
>
>
>
>
>
> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>
>> Since there is no OOTB feature that allows this, you have to write your
>> custom InputFormat to handle JSON data. Alternatively you could make use of
>> Pig or Hive as they have builtin JSON support.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>> rajeshnagaraju@gmail.com> wrote:
>>
>>> 1 simple way is to remove the new line characters so that the default
>>> record reader and default way the block is read will take care of the input
>>> splits and JSON will not get affected by the removal of NL character
>>>
>>>
>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>> sugandha.n87@gmail.com> wrote:
>>>
>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it will
>>>> be split into two blocks. Now, since my file is a json file, I cannot use
>>>> textinputformat. As, every input split(logical) will be a single line of
>>>> the json file. Which I dont want. Thus, in this case, can I write a custom
>>>> input format and a custom record reader so that, every input split(logical)
>>>> will have only that part of data which I require.
>>>>
>>>> For. e.g:
>>>>
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID": 33451.000000,
>>>> "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595, "OSM_SOURCE":
>>>> 1520846283.000000, "COST": 0.007058, "OSM_TARGET": 1520846293.000000, "X2":
>>>> 77.554549, "Y2": 12.993056, "CONGESTED_": 227.541279, "Y1": 12.993107,
>>>> "REVERSE_CO": 0.007058, "CONGESTION": 10.000000, "OSM_ID":
>>>> 138697535.000000, "START_ID": 33450.000000, "KM": 0.000000, "LENGTH":
>>>> 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K": 30.000000, "ROW_FLAG":
>>>> "F" }, "geometry": { "type": "LineString", "coordinates": [ [
>>>> 8633115.407361, 1458944.819456 ], [ 8633332.869986, 1458938.970140 ] ] }
}
>>>> ,
>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS": 3.000000,
>>>> "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID": 37016.000000,
>>>> "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462, "OSM_SOURCE":
>>>> 1037135286.000000, "COST": 0.003052, "OSM_TARGET": 1551615728.000000, "X2":
>>>> 77.537950, "Y2": 12.992099, "CONGESTED_": 176.806535, "Y1": 12.993377,
>>>> "REVERSE_CO": 0.003052, "CONGESTION": 20.000000, "OSM_ID": 89417379.000000,
>>>> "START_ID": 24882.000000, "KM": 0.000000, "LENGTH": 156.806535,
>>>> "REVERSE__1": 176.806535, "SPEED_IN_K": 50.000000, "ROW_FLAG": "F" },
>>>> "geometry": { "type": "LineString", "coordinates": [ [ 8631542.162393,
>>>> 1458975.665482 ], [ 8631485.144550, 1458829.592709 ] ] } }
>>>>
>>>> *I want here the every input split to consist of entire type data and
>>>> thus, I can process it accordingly by giving relevant k,V pairs to the map
>>>> function.*
>>>>
>>>>
>>>> --
>>>> Thanks & Regards,
>>>> Sugandha Naolekar
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>
>>>>> Hi Sugandha,
>>>>>
>>>>> Please find my comments embedded below :
>>>>>
>>>>>                   No. of mappers are decided as: Total_File_Size/Max.
>>>>> Block Size. Thus, if the file is smaller than the block size, only one
>>>>> mapper will be                               invoked. Right?
>>>>>                   This is true(but not always). The basic criteria
>>>>> behind map creation is the logic inside *getSplits* method of
>>>>> *InputFormat* being used in your                     MR job. It is
>>>>> the behavior of *file based InputFormats*, typically sub-classes of
>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>             on the total size, in bytes, of the input files. See
>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
>>>>> only 1 mapper will                     be created.
>>>>>
>>>>>                   If yes, it means, the map() will be called only
>>>>> once. Right? In this case, if there are two datanodes with a replication
>>>>> factor as 1: only one                               datanode(mapper
>>>>> machine) will perform the task. Right?
>>>>>                   A mapper is called for each split. Don't get
>>>>> confused with the MR's split and HDFS's block. Both are different(They
may
>>>>> overlap though, as in                     case of FileInputFormat). HDFS
>>>>> blocks are physical partitioning of your data, while an InputSplit is
just
>>>>> a logical partitioning. If you have a                       file which
is
>>>>> smaller than the HDFS blocksize then only one split will be created,
hence
>>>>> only 1 mapper will be called. And this will happen on
>>>>> the node where this file resides.
>>>>>
>>>>>                   The map() function is called by all the
>>>>> datanodes/slaves right? If the no. of mappers are more than the no. of
>>>>> slaves, what happens?
>>>>>                   map() doesn't get called by anybody. It rather gets
>>>>> created on the node where the chunk of data to be processed resides.
A
>>>>> slave node can run                       multiple mappers based on the
>>>>> availability of CPU slots.
>>>>>
>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>> mappers. Thus, those many no. of times the map() function will be called
>>>>> right?
>>>>>                  No. of blocks = no. of splits = no. of mappers. A map
>>>>> is called only once per split per node where that split is present.
>>>>>
>>>>> HTH
>>>>>
>>>>> Warm Regards,
>>>>> Tariq
>>>>> cloudfront.blogspot.com
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Hi Bertrand,
>>>>>>
>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But this
is
>>>>>> only true when you set isSplittable() as false or when your input
file size
>>>>>> is less than the block size. Also, when it comes to text files, the
default
>>>>>> textinputformat considers each line as one input split which can
be then
>>>>>> read by RecordReader in K,V format.
>>>>>>
>>>>>> Please correct me if I don't make sense.
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <dechouxb@gmail.com
>>>>>> > wrote:
>>>>>>
>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>
>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>
>>>>>>> Mapper is the name of the abstract class/interface. It does not
>>>>>>> really make sense to talk about number of mappers.
>>>>>>> A task is a jvm that can be launched only if there is a free
slot ie
>>>>>>> for a given slot, at a given time, there will be at maximum only
a single
>>>>>>> task. During the task, the configured Mapper will be instantiated.
>>>>>>>
>>>>>>> Always :
>>>>>>> Number of input splits = no. of map tasks
>>>>>>>
>>>>>>> And generally :
>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Bertrand
>>>>>>>
>>>>>>> PS : I don't know if it is only my client, but avoid red when
>>>>>>> writting a mail.
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <drdwitte@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Each node has a tasktracker with a number of map slots. A
map slot
>>>>>>>> hosts as mapper. A mapper executes map tasks. If there are
more map tasks
>>>>>>>> than slots obviously there will be multiple rounds of mapping.
>>>>>>>>
>>>>>>>> The map function is called once for each input record. A
block is
>>>>>>>> typically 64MB and can contain a multitude of record, therefore
a map task
>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>
>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>
>>>>>>>> Furthermore you have to make a distinction between the two
layers.
>>>>>>>> You have a layer for computations which consists of a jobtracker
and a set
>>>>>>>> of tasktrackers. The other layer is responsible for storage.
The HDFS has a
>>>>>>>> namenode and a set of datanodes.
>>>>>>>>
>>>>>>>> In mapreduce the code is executed where the data is. So if
a block
>>>>>>>> is in datanode 1, 2 and 3, then the map task associated with
this block
>>>>>>>> will likely be executed on one of those physical nodes, by
tasktracker 1, 2
>>>>>>>> or 3. But this is not necessary, thing can be rearranged.
>>>>>>>>
>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>
>>>>>>>> Regards, Dieter
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <sugandha.n87@gmail.com
>>>>>>>> >:
>>>>>>>>
>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers. Thus,
those
>>>>>>>>> many no. of times the map() function will be called right?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Thanks & Regards,
>>>>>>>>> Sugandha Naolekar
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar <
>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> As per the various articles I went through till date,
the File(s)
>>>>>>>>>> are split in chunks/blocks. On the same note, would
like to ask few things:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
Block
>>>>>>>>>>    Size. Thus, if the file is smaller than the block
size, only one mapper
>>>>>>>>>>    will be invoked. Right?
>>>>>>>>>>    2. If yes, it means, the map() will be called
only once.
>>>>>>>>>>    Right? In this case, if there are two datanodes
with a replication factor
>>>>>>>>>>    as 1: only one datanode(mapper machine) will perform
the task. Right?
>>>>>>>>>>    3. The map() function is called by all the datanodes/slaves
>>>>>>>>>>    right? If the no. of mappers are more than the
no. of slaves, what happens?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Thanks & Regards,
>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message