hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thamizhannal Paramasivam <thamizhanna...@gmail.com>
Subject Re: num of reducer
Date Fri, 17 Feb 2012 17:37:30 GMT
It worked me. Thanks a lot Bejoy.

Thanks
Thamizh

On Fri, Feb 17, 2012 at 3:08 PM, Bejoy Ks <bejoy.hadoop@gmail.com> wrote:

> Hi Tamizh
>          MultiFileInputFormat / CombineFileInputFormat is typically used
> where the input files are relatively small (typically less than a block
> size). When you use these, there is some loss in data locality, as all the
> splits a mapper process won't be in the same node.
>        TextInputFormat spawns one mapper each for one block in default
> (not one per file). Here you hold data locality pretty much compared to
> MultiFileInputFormat.
>       If your mapper is not very short lived and has some decent amount of
> processing involved then you can go with TextInputFormat . The one
> consideration you need to make is, on your specified input when this job is
> running it may span a larger number of map tasks there by occupying almost
> all your map task slots in your cluster. If there are other tasks to be
> triggred they may have to wait for free map slots. You may need to consider
> using a Scheduler for fair share of slots to other parallel jobs as well,
> if any.
>
> Regards
> Bejoy.K.S
>
>
>
> On Fri, Feb 17, 2012 at 10:26 AM, Thamizhannal Paramasivam <
> thamizhannal.p@gmail.com> wrote:
>
>> Thank you so much to Joey & Bejoy for your suggestions.
>>
>> The Job's input path has 1300-1400 text files and each of 100-200MB.
>>
>> I thought, TextInputFormat spans single mapper per file and
>> MultiFileInputFormat spans less number mapper(<(1300-1400)) that processes
>> more many input files.
>>
>> Which input format do you thing would be most appropriate in my case and
>> why?
>>
>> Looking forward to your reply.
>>
>> Thanks,
>> Thamizh
>>
>>
>>
>> On Thu, Feb 16, 2012 at 10:06 PM, Joey Echeverria <joey@cloudera.com>wrote:
>>
>>> Is your data size 100-200MB *total*?
>>>
>>> If so, then this is the expected behavior for MultiFileInputFormat. As
>>> Bejoy says, you can switch to TextInputFormat to get one mapper per block
>>> (min one mapper per file).
>>>
>>> -Joey
>>>
>>>
>>> On Thu, Feb 16, 2012 at 11:03 AM, Thamizhannal Paramasivam <
>>> thamizhannal.p@gmail.com> wrote:
>>>
>>>> Here are the input format for mapper.
>>>> Input Format: MultiFileInputFormat
>>>> MapperOutputKey : Text
>>>> MapperOutputValue: CustomWritable
>>>>
>>>> I shall not be in the position to upgrade hadoop-0.19.2 for some reason.
>>>>
>>>> I have checked in number of mapper on job-tracker.
>>>>
>>>> Thanks,
>>>> Thamizh
>>>>
>>>>
>>>> On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria <joey@cloudera.com>wrote:
>>>>
>>>>> Hi Tamil,
>>>>>
>>>>> I'd recommend upgrading to a newer release as 0.19.2 is very old. As
>>>>> for your question, most input formats should set the number mappers
>>>>> correctly. What input format are you using? Where did you see the number
of
>>>>> tasks it assigned to the job?
>>>>>
>>>>> -Joey
>>>>>
>>>>>
>>>>> On Thu, Feb 16, 2012 at 1:40 AM, Thamizhannal Paramasivam <
>>>>> thamizhannal.p@gmail.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>> I am using hadoop-0.19.2 and running a Mapper only Job on cluster.
>>>>>> It's input path has >1000 files of 100-200MB. Since, it is Mapper
only job,
>>>>>> I gave number Of reducer=0. So, it is using 2 mapper to run all the
input
>>>>>> files. If we did not state the number of mapper, would n't it pick
the 1
>>>>>> mapper per input file? Or Does the default won't it pick a fair num
of
>>>>>> mapper according to number input file?
>>>>>> Thanks,
>>>>>> tamil
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Joseph Echeverria
>>>>> Cloudera, Inc.
>>>>> 443.305.9434
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>>>
>>>
>>
>

Mime
View raw message