hadoop-hdfs-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Guttadauro, Jeff" <jeff.guttada...@here.com>
Subject RE: What is the class that launches the reducers?
Date Mon, 29 Aug 2016 14:12:55 GMT
I have not worked with Tez, but Hitesh's idea sounds promising.

If the Tez route doesn't work and you want to stick within the MR framework, AFAIK MR doesn’t
provide a mechanism for this type of workflow.  One approach I thought might get you what
you’re shooting for is to use the setup method in your Reducer class to loop while looking
for the presence of some file (in HDFS or S3), sort of like “_SUCCESS”, that you would
have your verification process create once your checks were done, perhaps in a location named
after the job id...

There could be some timeouts you would need to contend with depending on how long this would
need to wait before proceeding.  You may need to provide status back to the AppMaster periodically
within the loop.  There might even be a timeout setting I’m not aware of that would dictate
how long the setup step is allowed to take before moving into the actual reduce step, but
I’m not sure about that.

Good luck!  Curious to hear how you end up solving this one.

-----Original Message-----
From: Hitesh Shah [mailto:hitesh@apache.org] 
Sent: Friday, August 26, 2016 4:36 PM
To: xeon Mailinglist <xeonmailinglist@gmail.com>
Cc: user@hadoop.apache.org
Subject: Re: What is the class that launches the reducers?

Have you considered trying to use Tez with a 3-vertex DAG instead of trying to change the
MR framework? i.e. A->B, A->C, B->C where A is the original map, C is the reducer
and B being the verification stage I assume and C is configured to not start doing any work
until B’s verification completes? The above may or may not fit your requirements. Feel free
to drop any questions you have to the user@tez mailing list if you think about going down
this path. 

thanks
— Hitesh


> On Aug 25, 2016, at 11:07 PM, xeon Mailinglist <xeonmailinglist@gmail.com> wrote:
> 
> Right now the map and reduce task produces digests of the output. This 
> logic is inside the map and reduce functions. I need to pause the 
> execution when all maps finish because there will be an external 
> program that is synchronizing several mapreduce runtimes. When all map 
> tasks finish from the several jobs, the map output will be verified. 
> Then, this external program will resume the execution.
> 
> I really want to create a knob in mapreduce by modifying the source 
> code, because with this knob I can exclude the identity maps execution 
> and boost the performance. I think the devs should create this feature.
> 
> Anyway, I am looking in the source code for the part where reduce 
> tasks are set to launch. Does anyone know which class launches the 
> reduce tasks in mapreduce v2?
> 
> On Aug 26, 2016 02:07, "Daniel Templeton" <daniel@cloudera.com> wrote:
> 
>> How are you intending to verify the map output?  It's only partially 
>> dumped to disk.  None of the intermediate data goes into HDFS.
>> 
>> Daniel
>> 
>> On Aug 25, 2016 4:10 PM, "xeon Mailinglist" 
>> <xeonmailinglist@gmail.com>
>> wrote:
>> 
>>> But then I need to set identity maps to run the reducers. If I 
>>> suspend a job after the maps finish, I don't need to set identity 
>>> maps up. I want to suspend a job so that I don't run identity maps 
>>> and get better performance.
>>> 
>>> On Aug 25, 2016 10:12 PM, "Haibo Chen" <haibochen@cloudera.com> wrote:
>>> 
>>> One thing you can try is to write a map-only job first and then 
>>> verify the map out.
>>> 
>>> On Thu, Aug 25, 2016 at 1:18 PM, xeon Mailinglist < 
>>> xeonmailinglist@gmail.com
>>>> wrote:
>>> 
>>>> I am using Mapreduce v2.
>>>> 
>>>> On Aug 25, 2016 8:18 PM, "xeon Mailinglist" 
>>>> <xeonmailinglist@gmail.com>
>>>> wrote:
>>>> 
>>>>> I am trying to implement a mechanism in MapReduce v2 that allows 
>>>>> to suspend and resume a job. I must suspend a job when all the 
>>>>> mappers
>>>> finish,
>>>>> and resume the job from that point after some time. I do this,
>>> because I
>>>>> want to verify the integrity of the map output before executing 
>>>>> the reducers.
>>>>> 
>>>>> I am looking for the class that tells when the Reduce tasks should
>>> start.
>>>>> Does anyone know where is this?
>>>>> 
>>>> 
>>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org

Mime
View raw message