hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sugandha Naolekar <sugandha....@gmail.com>
Subject Re: Mappers vs. Map tasks
Date Thu, 27 Feb 2014 04:21:13 GMT
Joao Paulo,

Your suggestion is appreciated. Although, on a side note, what is more
tedious: Writing a custom InputFormat or changing the code which is
generating the input splits.?

--
Thanks & Regards,
Sugandha Naolekar





On Wed, Feb 26, 2014 at 8:03 PM, JoĆ£o Paulo Forny <jpforny@gmail.com> wrote:

> If I understood your problem correctly, you have one huge JSON, which is
> basically a JSONArray, and you want to process one JSONObject of the array
> at a time.
>
> I have faced the same issue some time ago and instead of changing the
> input format, I changed the code that was generating this input, to
> generate lots of JSONObjects, one per line. Hence, using the default
> TextInputFormat, the map function was getting called with the entire JSON.
>
> A JSONArray is not good for a mapreduce input since it has a first [ and a
> last ] and commas between the JSONs of the array. The array can be
> represented as the file that the JSONs belong.
>
> Of course, this approach works only if you can modify what is generating
> the input you're talking about.
>
>
> 2014-02-26 8:25 GMT-03:00 Mohammad Tariq <dontariq@gmail.com>:
>
> In that case you have to convert your JSON data into seq files first and
>> then do the processing.
>>
>> Warm Regards,
>> Tariq
>> cloudfront.blogspot.com
>>
>>
>> On Wed, Feb 26, 2014 at 4:43 PM, Sugandha Naolekar <
>> sugandha.n87@gmail.com> wrote:
>>
>>> Can I use SequenceFileInputFormat to do the same?
>>>
>>>  --
>>> Thanks & Regards,
>>> Sugandha Naolekar
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Feb 26, 2014 at 4:38 PM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>
>>>> Since there is no OOTB feature that allows this, you have to write your
>>>> custom InputFormat to handle JSON data. Alternatively you could make use
of
>>>> Pig or Hive as they have builtin JSON support.
>>>>
>>>> Warm Regards,
>>>> Tariq
>>>> cloudfront.blogspot.com
>>>>
>>>>
>>>> On Wed, Feb 26, 2014 at 10:07 AM, Rajesh Nagaraju <
>>>> rajeshnagaraju@gmail.com> wrote:
>>>>
>>>>> 1 simple way is to remove the new line characters so that the default
>>>>> record reader and default way the block is read will take care of the
input
>>>>> splits and JSON will not get affected by the removal of NL character
>>>>>
>>>>>
>>>>> On Wed, Feb 26, 2014 at 10:01 AM, Sugandha Naolekar <
>>>>> sugandha.n87@gmail.com> wrote:
>>>>>
>>>>>> Ok. Got it. Now I have a single file which is of 129MB. Thus, it
will
>>>>>> be split into two blocks. Now, since my file is a json file, I cannot
use
>>>>>> textinputformat. As, every input split(logical) will be a single
line of
>>>>>> the json file. Which I dont want. Thus, in this case, can I write
a custom
>>>>>> input format and a custom record reader so that, every input split(logical)
>>>>>> will have only that part of data which I require.
>>>>>>
>>>>>> For. e.g:
>>>>>>
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 42.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 33451.000000, "OSM_META": "", "REVERSE_LE": 217.541279, "X1": 77.552595,
>>>>>> "OSM_SOURCE": 1520846283.000000, "COST": 0.007058, "OSM_TARGET":
>>>>>> 1520846293.000000, "X2": 77.554549, "Y2": 12.993056, "CONGESTED_":
>>>>>> 227.541279, "Y1": 12.993107, "REVERSE_CO": 0.007058, "CONGESTION":
>>>>>> 10.000000, "OSM_ID": 138697535.000000, "START_ID": 33450.000000,
"KM":
>>>>>> 0.000000, "LENGTH": 217.541279, "REVERSE__1": 227.541279, "SPEED_IN_K":
>>>>>> 30.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8633115.407361, 1458944.819456 ], [ 8633332.869986,
>>>>>> 1458938.970140 ] ] } }
>>>>>> ,
>>>>>> { "type": "Feature", "properties": { "OSM_NAME": "", "FLAGS":
>>>>>> 3.000000, "CLAZZ": 32.000000, "ROAD_TYPE": 3.000000, "END_ID":
>>>>>> 37016.000000, "OSM_META": "", "REVERSE_LE": 156.806535, "X1": 77.538462,
>>>>>> "OSM_SOURCE": 1037135286.000000, "COST": 0.003052, "OSM_TARGET":
>>>>>> 1551615728.000000, "X2": 77.537950, "Y2": 12.992099, "CONGESTED_":
>>>>>> 176.806535, "Y1": 12.993377, "REVERSE_CO": 0.003052, "CONGESTION":
>>>>>> 20.000000, "OSM_ID": 89417379.000000, "START_ID": 24882.000000, "KM":
>>>>>> 0.000000, "LENGTH": 156.806535, "REVERSE__1": 176.806535, "SPEED_IN_K":
>>>>>> 50.000000, "ROW_FLAG": "F" }, "geometry": { "type": "LineString",
>>>>>> "coordinates": [ [ 8631542.162393, 1458975.665482 ], [ 8631485.144550,
>>>>>> 1458829.592709 ] ] } }
>>>>>>
>>>>>> *I want here the every input split to consist of entire type data
and
>>>>>> thus, I can process it accordingly by giving relevant k,V pairs to
the map
>>>>>> function.*
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thanks & Regards,
>>>>>> Sugandha Naolekar
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 26, 2014 at 2:09 AM, Mohammad Tariq <dontariq@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Sugandha,
>>>>>>>
>>>>>>> Please find my comments embedded below :
>>>>>>>
>>>>>>>                   No. of mappers are decided as:
>>>>>>> Total_File_Size/Max. Block Size. Thus, if the file is smaller
than the
>>>>>>> block size, only one mapper will be                         
     invoked.
>>>>>>> Right?
>>>>>>>                   This is true(but not always). The basic criteria
>>>>>>> behind map creation is the logic inside *getSplits* method of
>>>>>>> *InputFormat* being used in your                     MR job.
It is
>>>>>>> the behavior of *file based InputFormats*, typically sub-classes
of
>>>>>>> *FileInputFormat*, to split the input data into splits based
>>>>>>>               on the total size, in bytes, of the input files.
See
>>>>>>> *this*<http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapreduce/InputFormat.html>for
more details. And yes, if the file is smaller than the block size then
>>>>>>> only 1 mapper will                     be created.
>>>>>>>
>>>>>>>                   If yes, it means, the map() will be called
only
>>>>>>> once. Right? In this case, if there are two datanodes with a
replication
>>>>>>> factor as 1: only one                               datanode(mapper
>>>>>>> machine) will perform the task. Right?
>>>>>>>                   A mapper is called for each split. Don't get
>>>>>>> confused with the MR's split and HDFS's block. Both are different(They
may
>>>>>>> overlap though, as in                     case of FileInputFormat).
HDFS
>>>>>>> blocks are physical partitioning of your data, while an InputSplit
is just
>>>>>>> a logical partitioning. If you have a                       file
which is
>>>>>>> smaller than the HDFS blocksize then only one split will be created,
hence
>>>>>>> only 1 mapper will be called. And this will happen on
>>>>>>> the node where this file resides.
>>>>>>>
>>>>>>>                   The map() function is called by all the
>>>>>>> datanodes/slaves right? If the no. of mappers are more than the
no. of
>>>>>>> slaves, what happens?
>>>>>>>                   map() doesn't get called by anybody. It rather
>>>>>>> gets created on the node where the chunk of data to be processed
resides. A
>>>>>>> slave node can run                       multiple mappers based
on the
>>>>>>> availability of CPU slots.
>>>>>>>
>>>>>>>                  One more thing to ask: No. of blocks = no. of
>>>>>>> mappers. Thus, those many no. of times the map() function will
be called
>>>>>>> right?
>>>>>>>                  No. of blocks = no. of splits = no. of mappers.
A
>>>>>>> map is called only once per split per node where that split is
present.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Warm Regards,
>>>>>>> Tariq
>>>>>>> cloudfront.blogspot.com
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 25, 2014 at 3:54 PM, Sugandha Naolekar <
>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Bertrand,
>>>>>>>>
>>>>>>>> As you said, no. of HDFS blocks =  no. of input splits. But
this is
>>>>>>>> only true when you set isSplittable() as false or when your
input file size
>>>>>>>> is less than the block size. Also, when it comes to text
files, the default
>>>>>>>> textinputformat considers each line as one input split which
can be then
>>>>>>>> read by RecordReader in K,V format.
>>>>>>>>
>>>>>>>> Please correct me if I don't make sense.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Thanks & Regards,
>>>>>>>> Sugandha Naolekar
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Feb 25, 2014 at 2:07 PM, Bertrand Dechoux <
>>>>>>>> dechouxb@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> The wiki (or Hadoop The Definitive Guide) are good ressources.
>>>>>>>>>
>>>>>>>>> https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-7/input-formats
>>>>>>>>>
>>>>>>>>> Mapper is the name of the abstract class/interface. It
does not
>>>>>>>>> really make sense to talk about number of mappers.
>>>>>>>>> A task is a jvm that can be launched only if there is
a free slot
>>>>>>>>> ie for a given slot, at a given time, there will be at
maximum only a
>>>>>>>>> single task. During the task, the configured Mapper will
be instantiated.
>>>>>>>>>
>>>>>>>>> Always :
>>>>>>>>> Number of input splits = no. of map tasks
>>>>>>>>>
>>>>>>>>> And generally :
>>>>>>>>> number of hdfs blocks = number of input splits
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Bertrand
>>>>>>>>>
>>>>>>>>> PS : I don't know if it is only my client, but avoid
red when
>>>>>>>>> writting a mail.
>>>>>>>>>
>>>>>>>>> On Tue, Feb 25, 2014 at 8:49 AM, Dieter De Witte <
>>>>>>>>> drdwitte@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Each node has a tasktracker with a number of map
slots. A map
>>>>>>>>>> slot hosts as mapper. A mapper executes map tasks.
If there are more map
>>>>>>>>>> tasks than slots obviously there will be multiple
rounds of mapping.
>>>>>>>>>>
>>>>>>>>>> The map function is called once for each input record.
A block is
>>>>>>>>>> typically 64MB and can contain a multitude of record,
therefore a map task
>>>>>>>>>> = run the map() function on all records in the block.
>>>>>>>>>>
>>>>>>>>>> Number of blocks = no. of map tasks (not mappers)
>>>>>>>>>>
>>>>>>>>>> Furthermore you have to make a distinction between
the two
>>>>>>>>>> layers. You have a layer for computations which consists
of a jobtracker
>>>>>>>>>> and a set of tasktrackers. The other layer is responsible
for storage. The
>>>>>>>>>> HDFS has a namenode and a set of datanodes.
>>>>>>>>>>
>>>>>>>>>> In mapreduce the code is executed where the data
is. So if a
>>>>>>>>>> block is in datanode 1, 2 and 3, then the map task
associated with this
>>>>>>>>>> block will likely be executed on one of those physical
nodes, by
>>>>>>>>>> tasktracker 1, 2 or 3. But this is not necessary,
thing can be rearranged.
>>>>>>>>>>
>>>>>>>>>> Hopefully this gives you a little more insigth.
>>>>>>>>>>
>>>>>>>>>> Regards, Dieter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2014-02-25 7:05 GMT+01:00 Sugandha Naolekar <
>>>>>>>>>> sugandha.n87@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>  One more thing to ask: No. of blocks = no. of mappers.
Thus,
>>>>>>>>>>> those many no. of times the map() function will
be called right?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Feb 25, 2014 at 11:27 AM, Sugandha Naolekar
<
>>>>>>>>>>> sugandha.n87@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> As per the various articles I went through
till date, the
>>>>>>>>>>>> File(s) are split in chunks/blocks. On the
same note, would like to ask few
>>>>>>>>>>>> things:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    1. No. of mappers are decided as: Total_File_Size/Max.
>>>>>>>>>>>>    Block Size. Thus, if the file is smaller
than the block size, only one
>>>>>>>>>>>>    mapper will be invoked. Right?
>>>>>>>>>>>>    2. If yes, it means, the map() will be
called only once.
>>>>>>>>>>>>    Right? In this case, if there are two
datanodes with a replication factor
>>>>>>>>>>>>    as 1: only one datanode(mapper machine)
will perform the task. Right?
>>>>>>>>>>>>    3. The map() function is called by all
the datanodes/slaves
>>>>>>>>>>>>    right? If the no. of mappers are more
than the no. of slaves, what happens?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Thanks & Regards,
>>>>>>>>>>>> Sugandha Naolekar
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message