hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@yahoo-inc.com>
Subject Re: How do hadoop work in details
Date Thu, 13 Jan 2011 02:48:24 GMT
Felix,

  Which version of the CS i.e. hadoop are you looking at?

  As I mentioned, you'll need to wait a bit for the release I'm  
working towards to use the max-limit feature. It's present in  
hadoop-0.21/hadoop-0.22 presently, but not in hadoop-0.20.

Arun

On Jan 12, 2011, at 4:55 PM, felix gao wrote:

> Anrun,
>
> I went to the doc for capacity scheduler it seems the sum of the all  
> the capacities should be 100, but there isn't any options to specify  
> the max.  The options are mapred.capacity-scheduler.queue.<queue- 
> name>.capacity, mapred.capacity-scheduler.queue.<queue- 
> name>.supports-priority, and mapred.capacity-scheduler.queue.<queue- 
> name>.minimum-user-limit-percent.
>
> Can you show me how to configure the scheduler like you indicated?  
> Also how do I assign users to that queue? For example. all jobs  
> submitted by user hadoop uses the production queue and everyone else  
> uses the default queue.
>
> Thanks,
>
> Felix
>
>
>
>
> On Wed, Jan 12, 2011 at 1:50 PM, Arun C Murthy <acm@yahoo-inc.com>  
> wrote:
> The approach I run for Yahoo, for pretty much the same use case, is  
> to use the CapacityScheduler and define two queues:
> production
> adhoc
>
> Lets say you have 30% as the capacity you want for production, rest  
> for adhoc.
> production could have 70% capacity, max-limit of 100
> adhoc can have 30% capacity, max-limit of 70/80.
>
> This way adhoc jobs can take upto 70-80% of the cluster, but save  
> some for 'production' jobs at all times. You get the idea?
>
> I'm sure there are similar tricks for FairScheduler, I just am not  
> familiar enough with it. I'll warn you that I only run Yahoo  
> clusters, we use the CS everywhere.
>
> One other note: I'm in bang in the middle of releasing extensive  
> enhancements to CapacityScheduler via hadoop-0.20.100 or whatever we  
> decide to call it:
>
> http://www.mail-archive.com/general@hadoop.apache.org/msg02670.html
>
> Arun
>
> On Jan 12, 2011, at 9:40 AM, felix gao wrote:
>
>> Arun,
>>
>> The information is very helpful.  What scheduler do you suggest to  
>> when we have mixed of production and adhoc jobs are running the  
>> same time using pig and we would like to guarantee the SLA for  
>> production task.
>>
>> Thanks,
>>
>> Felix
>>
>> On Sun, Jan 9, 2011 at 12:35 AM, Arun C Murthy <acm@yahoo-inc.com>  
>> wrote:
>>
>> On Dec 29, 2010, at 2:43 PM, felix gao wrote:
>>
>> Hi all,
>>
>> I am trying to figure out how exactly happens inside the job.
>>
>> 1) When the jobtracker launches a task to be run, how does it  
>> impact the currently running jobs if the the current running job  
>> have higher, same, or lower priories using the default queue.
>>
>> 2) What if a low priority job is running that is holding all the  
>> reducer slots and the mappers are halfway done and a high priority  
>> job comes in took all the mappers but cannot complete but all the  
>> reducer slots are taken by the low priority job?
>>
>>
>> Both 1) and 2) really depends on the scheduler you are using -  
>> Default, FairShare or CapacityScheduler.
>>
>> With the Default scheduler, 2) is a real problem. The  
>> CapacityScheduler doesn't allow priorities within the same queue  
>> for precisely the same reason since it doesn't have preemption. I'm  
>> not sure if FairScheduler handles it.
>>
>>
>> 3) when is mappers allocated on the slaves, and when is reducers  
>> allocated.
>>
>>
>> Usually, reduces are allocated only after a certain percentage of  
>> maps are complete (5% by default). Use  
>> mapred.reduce.slowstart.completed.maps to control this. Look at  
>> JobInProgress.java.
>>
>>
>> 4)Does mappers pass all the data to reducers using RPC or they  
>> write their output to HDFS and the reducers pick it up.
>>
>>
>> Maps sort/combine their output and write to local-disk. The reduces  
>> then copy them (we call it the 'shuffle' phase) over http. The TT  
>> on which the map ran will serve the map's output via an embedded  
>> webserver. Look at ReduceTask.java and  
>> TaskTracker.MapOutputServelt.doGet.
>>
>>
>> 5) within a job, when and where is all the io occurs.
>>
>>
>> Typically input to map i.e. InputFormat and output of reduce i.e.  
>> OutputFormat. Look at MapTask.java and ReduceTask.java.
>>
>> hope this helps,
>> Arun
>>
>>
>>
>> I know this seems to be a lot of low level questions , if you can  
>> point me to the right place to look is should be enough.
>>
>> Thanks,
>>
>> Felix
>>
>>
>>
>
>


Mime
View raw message