hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: How do hadoop work in details
Date Wed, 12 Jan 2011 21:50:02 GMT
The approach I run for Yahoo, for pretty much the same use case, is to  
use the CapacityScheduler and define two queues:
production
adhoc

Lets say you have 30% as the capacity you want for production, rest  
for adhoc.
production could have 70% capacity, max-limit of 100
adhoc can have 30% capacity, max-limit of 70/80.

This way adhoc jobs can take upto 70-80% of the cluster, but save some  
for 'production' jobs at all times. You get the idea?

I'm sure there are similar tricks for FairScheduler, I just am not  
familiar enough with it. I'll warn you that I only run Yahoo clusters,  
we use the CS everywhere.

One other note: I'm in bang in the middle of releasing extensive  
enhancements to CapacityScheduler via hadoop-0.20.100 or whatever we  
decide to call it:

http://www.mail-archive.com/general@hadoop.apache.org/msg02670.html

Arun

On Jan 12, 2011, at 9:40 AM, felix gao wrote:

> Arun,
>
> The information is very helpful.  What scheduler do you suggest to  
> when we have mixed of production and adhoc jobs are running the same  
> time using pig and we would like to guarantee the SLA for production  
> task.
>
> Thanks,
>
> Felix
>
> On Sun, Jan 9, 2011 at 12:35 AM, Arun C Murthy <acm@yahoo-inc.com>  
> wrote:
>
> On Dec 29, 2010, at 2:43 PM, felix gao wrote:
>
> Hi all,
>
> I am trying to figure out how exactly happens inside the job.
>
> 1) When the jobtracker launches a task to be run, how does it impact  
> the currently running jobs if the the current running job have  
> higher, same, or lower priories using the default queue.
>
> 2) What if a low priority job is running that is holding all the  
> reducer slots and the mappers are halfway done and a high priority  
> job comes in took all the mappers but cannot complete but all the  
> reducer slots are taken by the low priority job?
>
>
> Both 1) and 2) really depends on the scheduler you are using -  
> Default, FairShare or CapacityScheduler.
>
> With the Default scheduler, 2) is a real problem. The  
> CapacityScheduler doesn't allow priorities within the same queue for  
> precisely the same reason since it doesn't have preemption. I'm not  
> sure if FairScheduler handles it.
>
>
> 3) when is mappers allocated on the slaves, and when is reducers  
> allocated.
>
>
> Usually, reduces are allocated only after a certain percentage of  
> maps are complete (5% by default). Use  
> mapred.reduce.slowstart.completed.maps to control this. Look at  
> JobInProgress.java.
>
>
> 4)Does mappers pass all the data to reducers using RPC or they write  
> their output to HDFS and the reducers pick it up.
>
>
> Maps sort/combine their output and write to local-disk. The reduces  
> then copy them (we call it the 'shuffle' phase) over http. The TT on  
> which the map ran will serve the map's output via an embedded  
> webserver. Look at ReduceTask.java and  
> TaskTracker.MapOutputServelt.doGet.
>
>
> 5) within a job, when and where is all the io occurs.
>
>
> Typically input to map i.e. InputFormat and output of reduce i.e.  
> OutputFormat. Look at MapTask.java and ReduceTask.java.
>
> hope this helps,
> Arun
>
>
>
> I know this seems to be a lot of low level questions , if you can  
> point me to the right place to look is should be enough.
>
> Thanks,
>
> Felix
>
>
>


Mime
View raw message