hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mahadev konar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-719) Integration of Hadoop with batch schedulers
Date Tue, 14 Nov 2006 23:24:38 GMT
    [ http://issues.apache.org/jira/browse/HADOOP-719?page=comments#action_12449858 ] 
Mahadev konar commented on HADOOP-719:

The three components of HOD are :
bin/hod : This is the part of HOD that talks to the underlying scheduler/resource manager
(condor, torque etc.) and asks for machines. This is the main component of HOD which allocates
specified number of nodes and then monitors the HOD instance for any updates/ failuers. A
HOD instance is a map reduce instance of Hadoop.
bin/hod-tt : This is the script taht is run on the node which invokes a tasktracker instance.
bin/hod-jt: This is the scipt that invokes the jobtracker instance on a given node.

The basic idea being:
--- The user invokes bin/hod with all the required arguments (including the job.jar and hadoop
--- bin/hod gets the required number of nodes from the underlying resource managers
     --- invokes bin/hod-jt on the machine that will be running the jobtracker
     --- invokes bin/hod-tt on the machine that will be running task trackers

Here is a design document for HOD. 
----- Design Document for bin/hod
The two key design points for bin/hod are :

 1. Abstract Factory that allows batch scheduler specific behaviours behind common API  
 2. event message loop that abstracts post-startup interaction between hadoop and batch scheduler.
handles grow/shrink, termination, etc.

The command line parameters for bin/hod are:

  -jar <map-reduce JAR>

  -resource-manager torque, condor, ...

  -min-node <n>
   the minimum number of nodes required for this map reduce instance

  -max-node <m>
   the maximum number of nodes required for this map reduce instance 

  -qos <qos-level>

  -account <project-account-name>

  -wall-time <t>
  -S or -to-scheduler k=v,k1=v1,... 
  -H or -to-hadoop k=v,... 
  -J or -to-jobclient k=v,... 
  and system-related ones like workdir, namenode addr, resource manager package dir, hadoop
dir or tarball, etc

The Classes that bin/hod will contain are the following:

  This is an abstract factory for all the cluster configs.
  Can be instantiated to --    

  ... etc   
  These classes have information on what the cluster specific commands/configurations are-
  submit() method would map to condor_submit in case of Condor and qsub in case of torque

   This abstract class maps to the JobTracker class.  A proxy for the Jobtracker.

   This abstract class maps to the tasktrackers /worker nodes

The above two classes can have batch scheduler specific ClusterHead/Body  as there subclasses.

There are two different classes for TaskTrackers/JobTracker since the reactions to events
caused at at the JobTracker/Tasktrackers is different. 

This class monitors  the map reduce instance including JobTracker/Tasktrackers/jobclient,
polls them for events and handles all those events. This will also
be extended to batch scheduler specific cluster monitor and would implement common functionality
(like querying the jobtracker/tasktrackers)


This class handles the basic arguments of map/reduce jobs. The input dir/output dir and number
of maps/reduces. Later we could use  the input dir information to ask for nodes that are local/rack
local to the input files for map.

The pseudo code for bin/hod would look like--

   Config cfg = new Config(args) // this is main command line parser
   JobConfig jc = new JobConfig(cfg) // uses the following args -- input dir, output dir,
number of maps, number of reduces

   // The cluster config takes a jobconfig argument to create scheduler config that can specify
what kind of nodes it wants (nodes local to some rack ?)
   ClusterConfig cc = ClusterConfig.createInstance(cfg, jc) // uses any scheduler specific
arguments from cfg

   // The cluster head is the jobtracker is always launched first
   ClusterHead ch = ClusterHead.createInstance(cfg, cc) // gets the hadoop specific args from

   // The clusterBody takes cluster head as the input parameter 
   ClusterBody cb = ClusterBody.createInstance(cfg, ch, cc) 

   JobClient jobclient = new JobClient(jobconf) 
   ClusterMonitor cm = new ClusterMonitor(ch, cb, jobclient)
   while(e = cm.newEvent()){ // the events might be of the type-- jobtracker failed, tasktracker
failed, only a few reduces are left with most of nodes lying 

idle, etc...


> Integration of Hadoop with batch schedulers
> -------------------------------------------
>                 Key: HADOOP-719
>                 URL: http://issues.apache.org/jira/browse/HADOOP-719
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/streaming
>            Reporter: Mahadev konar
>         Assigned To: Mahadev konar
> Hadoop On Demand (HOD) is an integration of Hadoop with batch schedulers like Condor/torque/sun
grid etc. Hadoop On Demand or HOD hereafter is a system that populates a Hadoop instance using
a shared batch scheduler. HOD will find a requested number of nodes and start up Hadoop daemons
on them. Users map reduce jobs can then run on the hadoop instance. After the job is done,
HOD gives back the  nodes to the shared batch scheduler. A group of users will use HOD to
acquire Hadoop instances of varying sizes and the batch scheduler will schedule requests in
a way that important jobs gain more importance/resources and finish fast. Here are a list
of requirements for HOD and batch schedulers:
> Key Requirements :
> --- Should allocate the specified minimum number of nodes for a job 
>    Many batch jobs can finish in time, only when enough resources are allocated. Therefore
batch scheduler should allocate the asked number of nodes for a given job when the job starts.
This is simple form of what's known as gang    scheduling.
>   Often the minimum nodes are not available right away, especially if the job asked for
a large number. The batch scheduler should support advance reservation for important jobs
so that the wait time can be determined. In advance   reservation, a reservation is created
on earliest future point when the preoccupied nodes become available. When nodes are currently
idle but booked by future reservations, batch scheduler is ok to give them to other jobs to
increase system utilization, but only when doing so does not delay existing reservations.
> --- run short urgent job without costing too much loss to long job. Especially, should
not kill job tracker of long job. 
>   Some jobs, mostly short ones, are time sensitive and need urgent treatment. Often,
large portion of cluster nodes will be occupied by long running jobs. Batch scheduler should
be able to preempt long jobs and run urgent jobs. Then, urgent jobs will finish quickly and
long jobs can re-gain the nodes afterward. 
> When preemption happens, HOD should minimize the loss to long jobs. Especially, it should
not kill job tracker of long job.
> --- be able to dial up, at run time, share of resources for more important projects.
>   Viewed at high level, a given cluster is shared by multiple projects. A project consists
of a number of jobs submitted by a group of users.Batch scheduler should allow important projects
to have more resources. This should be tunable at run time as what projects deem more important
may change over time. 
> --- prevent malicious abuse of the system. 
>   A shared cluster environment can be put in jeopardy if malicious or erroneous job code
>  -- hold unneeded resources for a long period 
>  -- use privileges for unworthy work 
>   Such abuse can easily cause under-utilization or starvation of other jobs. Batch scheduler
should allow  setting up policies for preventing resource abuse by: 
>  -- limit privileges to legitimate uses asking for proper amount 
>  -- throttle peak use of resources per player 
>  -- monitor and reduce starvation 
> --- The behavior should be simple and predictable 
>    When status of the system is queried, we should be able to determine what factors
caused it to reach current status and what could be the future behavior with or without our
tuning on the system. 
> --- be portable to major resource managers 
>    HOD design should be portable so that in future we are able to plugin other resource
> Some of the key requirements are implemented by the batch schedulers. The others need
to be implemented by HOD.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message