hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vivek Ratan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4445) Wrong number of running map/reduce tasks are displayed in queue information.
Date Mon, 03 Nov 2008 06:00:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12644645#action_12644645
] 

Vivek Ratan commented on HADOOP-4445:
-------------------------------------

Hemanth and I looked at what's going on here. Essentially, there are two sources of truth,
regarding the number of running tasks in the system. Each JobInProgress object maintains counts
of running map and reduce tasks. These counts are incremented when a task is assigned to a
TT (in obtainNewMapTask() or obtainNewReduceTask()). These counts are used the by the CapacityScheduler
. The cluster summary, represented by the ClusterStatus object, also contains counts of the
total number of maps and reduce tasks. These are incremented by the JT using the TT status.
The counts maintained by the JobInProgress objects and the ClusterStatus object, are off by
a heartbeat. The former increments its counts when a task is assigned. Once the task runs
on a TT, its running status is conveyed to the JT in the TT's next heartbeat. During startup,
a lot of TTs approach the JT for tasks to run. As a result, the counts of running tasks across
all JobInProgress objects are much higher than the cluster count, since the cluster count
is updated only when the TTs report their status in their next hearbeat. That explains the
discrepancy reported in this Jira. In steady state, these two counts are mostly identical,
or off by a little bit, as TTs finish their tasks at different times. 

This is not really a bug, as it's not clear which count is 'correct'. We're reporting from
two different sources: the cluster summary and the Scheduler (which gets it info from the
JobInProgress objects). But different numbers do get reflected in the UI. So the best fix
is to probably indicate in the  Scheduler part of the UI that its computation is off from
the cluster summary by a heartbeat. Maybe a little explanation in the bottom that says something
like: "This info varies from that of the cluster summary by a heartbeat". 

I don't think we should change anything in the scheduler or the cluster summary. They're both
doing the right thing their own way. An alternate solution is to have the cluster summary
use the counts from the JobInProgress objects, but this is performance-intensive, and was
presumably the reason why the cluster summary maintains its own count. 

You do want to the leave the rest of the UI as is. The cluster summary is useful, as is the
per-queue information of running tasks (reported by the Scheduler) as it lets users know whether
the queue is running above/at/below its guaranteed capacity. 

bq. Hence, the waiting counts should be removed from the scheduler information.
The scheduler maintains  a partial waiting count of map/reduce tasks. It doesn't need to know
the total number of pending tasks if this total is larger than the cluster capacity. So, for
performance reasons, it only counts up to the cluster capacity. HADOOP-4576 has been opened
for this purpose and suggests that we display pending jobs instead of pending tasks, as the
former seems more useful to users. 


> Wrong number of running map/reduce tasks are displayed in queue information.
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-4445
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4445
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Hadoop r705159, Queue=default, GC=100% MapCapacity=ReduceCapacity=212
>            Reporter: Karam Singh
>            Assignee: Sreekanth Ramakrishnan
>
> Wrong number of running map/reduce tasks are displayed in queue information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message