hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rudra Tripathy <rudra...@gmail.com>
Subject Re: Sorting the inputSplits
Date Wed, 19 Aug 2015 04:20:33 GMT
Hi Nishanth
Even if u ordered input split u can't order d output
On Aug 19, 2015 1:55 AM, "Nishanth S" <chinchu2884@gmail.com> wrote:

> Thank you.I have   explained the problem better here below.Is this
> possible?.
>
>
> We have a use case where we have files in   the below directory structure.
> The requirement is that we  should not process files inside a Parent
> directory in parallel(1.txt and 2.txt  cannot be processed in parallel
> since we need to do some check pointing we have to process the oldest file
> first).How ever 1.txt and 5.txt can be processed in parallel. Right now I
> am  over riding the list status method to pick only the oldest file but
> this means I cannot achieve parallelism outside the parent as well since
> the number of input splits is always 1. What would be  the way to go about
> this use case ?.In short I want to achieve parallelism outside Parent
> directory but not within it. Please advise.
>
>
>
> published/
>
> +-- Parent1/
>
> ¦       +-- 1.txt
>
> ¦       +-- 2.txt
>
> ¦       +-- 3.txt
>
> +-- Parent2/
>
>           +-- 4.txt
>
>            +-- 5.txt
>
>
>
>
> On Wed, Jul 29, 2015 at 5:31 PM, Gera Shegalov <gera@shegalov.com> wrote:
>
>> Can you clarify the requirement "processed first"? Maps run in parallel
>> without any ordering guarantees. If you want to affect the mapping
>> file->split number, you can implement your own getSplits in the custom
>> input format and return splits ordered anyway your like.
>>
>> On Wed, Jul 22, 2015 at 12:06 PM, Nishanth S <chinchu2884@gmail.com>
>> wrote:
>>
>>> Hey folks,
>>>
>>> Is their a way to sort the input splits  in map reduce.We have a case
>>> where there are two files file1 and file2 in the input directory.Since we
>>>  have custominputformat which   has issplittable return false always each
>>> of  these files would be processed  by  a different mapper.How could I make
>>> sure that  file1 is processed   before  file2(I want the oldest file to  be
>>> processed first).Is this possible?.
>>>
>>> Thanks,
>>> Nishan
>>>
>>
>>
>

Mime
View raw message