hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Sorting the inputSplits
Date Thu, 30 Jul 2015 18:50:15 GMT
If you meant 'scheduled' first perhaps thats doable by following (almost)
what Gera says. The framework actually explicitly sorts your InputSplits
list by its reported lengths, which would serve as the hack point for
inducing a reordering. See
https://github.com/apache/hadoop-common/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/JobSubmitter.java#L498-L503

On Thu, Jul 30, 2015 at 10:34 PM Niels Basjes <Niels@basjes.nl> wrote:

> MapReduce is based on the premise that several parts of a task can be
> processed independently in parallel.
> If you "require" an order of processing then these files are depending on
> each other. Why use MapReduce at all?
> With your requirement you cannot use more than one CPU anyway.
>
> Niels
>
> On Thu, 30 Jul 2015 01:31 Gera Shegalov <gera@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <chinchu2884@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>

Mime
View raw message