hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arun C Murthy (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-728) Mumak: Map-Reduce Simulator
Date Tue, 07 Jul 2009 21:52:14 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12728382#action_12728382
] 

Arun C Murthy commented on MAPREDUCE-728:
-----------------------------------------

h2. Mumak 1.0

The goal is to build a discrete event simulator to simulate conditions under which a Hadoop
Map-Reduce Scheduler performs on a large-scale Map-Reduce cluster running a specific workload.

Mumak takes as input a reasonably large workload (e.g. a month's worth of jobs from production
cluster(s)) and simulates them in a matter of hours if not minutes on very few machines.

h4. What is it not?

It is a non-goal to simulate the actual map/reduce tasks themselves.

The scope of Version 1.0 does not include specifics of trying to simulate the actual workload
itself. It will merely take a digest of the Hadoop Map-Reduce JobHistory of all jobs in the
workload, and faithfully assume the actual run-time of individual tasks from the digest without
simulating the tasks themselves. Clearly this will not try and simulate resources and their
utilization on the actual tasktrackers, interaction between running tasks on the tasktrackers
etc. The simulation of individual tasks is left for future versions.

Some other simplifications are also made (mainly due to the lacking of such information from
the job trace):

    * No job dependency. Jobs are faithfully submitted to the cluster as defined in the job
trace.
    * No modeling of failure correlations (eg a few task attempts fail due to a node failure,
but in the simulation run, the same set of task attempts may run on different nodes). 

h4. What goes in? What comes out?

The 'workload' alluded to in the previous sections needs elaboration. The proposal is to use
the job-history for all jobs which are part of the workload. The Hadoop Map-Reduce per-job
job-history is a very detailed log of each component task with run-times, counters etc. We
can use this to generate a per-job digest with all relevant information. Thus, it is quite
sufficient and feasible to collect workload from different clusters (research, production
etc.) to be then used during simulation.

More specifically, the following is a list of details it simulates:

    * It would simulate a cluster of the same size and network topology as where the source
trace comes from. The reason for this restriction is because data locality is an important
factor to the scheduler decisions and scaling the job traces obtained from cluster A and try
to simulate it on cluster B with different diameters require a much thorough understanding.
    * It would simulate failures faithfully as recorded in the job trace. Namely, if a particular
task attempt fails in the trace, it would fail in the simulation.
    * It would replay the same number of map tasks and reduce tasks as specified in the job
digest.
    * It would use the inputsplit locations as are recorded in the job trace. 

The simulator will generate the same job-history for each of the simulated jobs. Thus we can
use the same tools for slicing and dicing the output of the simulator.

h4. Design & Architecture

Design Goals

An overarching design goal for Mumak is that we should be able to use the exact same Map-Reduce
Schedulers (listed above) as-is without any changes. This implies that we use the same interfaces
used by Hadoop Map-Reduce so that it is trivial to plug-in the Scheduler of interest.

Along the same lines it is a legitimate goal to use all relevant Hadoop Map-Reduce interfaces
between various components so that it is trivial to replace each by the appropriate Hadoop
Map-Reduce component (e.g. run the simulator in a emulation mode with real Map-Reduce clusters
etc. in future).

Architecture

Mumak consists of the following components:

    * Discrete Event Simulator Engine with an event-queue
    * Simulated JobTracker
    * Simulated Cluster (set of tasktrackers)
    * Client for handling job-submission 

Engine

The Simulator Engine is the heart of Mumak. It manages all the discrete events in virtual
time and fires the appropriate handlers (JobClient, TaskTracker) when the events occur. Typically
each event responded to by a component results in a new set of events to be fired in the future
(virtual time).

Some of the various event-types are:

    * HeartbeatEvent - An event which instructs a specific Tasktracker to send a heartbeat
to the JobTracker.
    * TaskCompletionEvent - An event which denotes the completion (success/failure) of a specific
map or reduce task which is sent to the TaskTracker.
    * JobSubmissionEvent - An event which instructs the JobClient to submit a specific job
to the JobTracker 

Simulated JobTracker

The JobTracker is driver for the Map-Reduce Scheduler. On receipt of heartbeats from various
TaskTrackers it 'tracks' progress of the current jobs and forwards the appropriate information
to the Scheduler to allow it to make the task-scheduling decisions. The simulated JobTracker
uses the virtual time to allow the scheduler to make scheduling decisions.

The JobTracker also uses the per-job digest to fill-in information about expected runtime
for each of the tasks scheduled by the Scheduler to get Mumakil to simulate run-times for
each task.

The JobTracker is purely reactive in the sense that it only reacts to hearbeats sent by TaskTrackers.
Further more it does not directly handle any events from the Engine, it only responds to the
InterTrackerProtocol.heartbeat calls as in the real-world.

Simulated Cluster

The simulated cluster consists of an appropriate number of simulated TaskTrackers which respond
to events generated by Engine. Each simulated TaskTracker maintains state about currently
running tasks (all tasks are 'running' till an appropriate TaskCompletionEvent fires) and
sends periodic status updates to the JobTracker on receipt of HeartbeatEvent.

HeartbeatEvent

When a HeartbeatEvent fires, the appropriate TaskTracker build status-reports for each of
the running tasks and sends a hearbeat to the JobTracker (InterTrackerProtocol.heartbeat).
The JobTracker updates its data-structures (JobInProgress, TaskInProgress etc.) to refect
the latest state and forwards information to the Scheduler. If any new tasks are be to scheduled
on this TaskTracker the JobTracker also fills in expected run-times for each via information
gleaned from the job-digest. The TaskTracker then processes the instructions to launch the
new tasks and responds to the Engine by inserting a set of new TaskCompletionEvents for the
new tasks into the EventQueue.

TaskCompletionEvent

When a TaskCompletionEvent fires, the appropriate TaskTracker marks the relevant task as complete
and forwards that information to the JobTracker on the next HeartbeatEvent.

Simulated JobClient

The JobClient responds to JobSubmissionEvents sent by the Engine and submits the appropriate
jobs to the JobTracker via the standard JobSubmissionProtocol.

h4. Relevant Details

Job Summary for Simulation

The following can be derived from job history file by rumen:

    * Detailed job trace with properties and counters of each task attempt (of each task of
each job in a workload).
    * Digest of jobs in a workload. From the jobs in the workload, we can derive statistical
information of tasks to build a model which can help us fabricate tasks which not even scheduled
to run (e.g. tasks of a failed job which were never run since the job was declared as FAILED
soon after submission). Along the same lines, the digest will also have statistical details
for helping modelling run-times for data-local maps, rack-local maps and off-rack maps based
on data in the job-history logs. This is necessary for simulating tasks which might be scheduled
on different nodes in the simulation run by the scheduler. 

How to deal with failure in workload?

We will try to faithfully model task failures by replaying failed task-attempts by using information
in the detailed job-traces.

We also plan to build a simple statistical model of task failures which can then be used to
simulate tasks which were never scheduled since the job failed early etc.

Simulating Reduce Tasks

In Mumak 1.0 we do not plan to simulate the running of the actual map/reduce tasks. Given
that it is not possible to simulate the implicit dependency between completion of maps, the
shuffle phase and the start of the reduce phase of the reduce tasks. Hence, we have decided
to use a special AllMapsFinished event generated by the SimulatedJobTracker to trigger the
start of the reduce-phase. For the same reasons, we have to model the total runtime of the
reduce task as the summation of the time taken for completion of all maps and the time taken
for individual task to complete the reduce-phase by itself. Thus, we are not going to try
modelling the shuffle phase accurately.

Furthermore, we will ignore map-task failures due to failed shuffles since we are not simulating
the shuffle-phase.

----

Thoughts?


> Mumak: Map-Reduce Simulator
> ---------------------------
>
>                 Key: MAPREDUCE-728
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-728
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>             Fix For: 0.21.0
>
>
> h3. Vision:
> We want to build a Simulator to simulate large-scale Hadoop clusters, applications and
workloads. This would be invaluable in furthering Hadoop by providing a tool for researchers
and developers to prototype features (e.g. pluggable block-placement for HDFS, Map-Reduce
schedulers etc.) and predict their behaviour and performance with reasonable amount of confidence,
there-by aiding rapid innovation.
> ----
> h3. First Cut: Simulator for the Map-Reduce Scheduler
> The Map-Reduce Scheduler is a fertile area of interest with at least four schedulers,
each with their own set of features, currently in existence: Default Scheduler, Capacity Scheduler,
Fairshare Scheduler & Priority Scheduler.
> Each scheduler's scheduling decisions are driven by many factors, such as fairness, capacity
guarantee, resource availability, data-locality etc.
> Given that, it is non-trivial to accurately choose a single scheduler or even a set of
desired features to predict the right scheduler (or features) for a given workload. Hence
a simulator which can predict how well a particular scheduler works for some specific workload
by quickly iterating over schedulers and/or scheduler features would be quite useful.
> So, the first cut is to implement a simulator for the Map-Reduce scheduler which take
as input a job trace derived from production workload and a cluster definition, and simulates
the execution of the jobs in as defined in the trace in this virtual cluster. As output, the
detailed job execution trace (recorded in relation to virtual simulated time) could then be
analyzed to understand various traits of individual schedulers (individual jobs turn around
time, throughput, faireness, capacity guarantee, etc). To support this, we would need a simulator
which could accurately model the conditions of the actual system which would affect a schedulers
decisions. These include very large-scale clusters (thousands of nodes), the detailed characteristics
of the workload thrown at the clusters, job or task failures, data locality, and cluster hardware
(cpu, memory, disk i/o, network i/o, network topology) etc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message