hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hemanth Yamijala (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-4445) Wrong number of running map/reduce tasks are displayed in queue information.
Date Mon, 10 Nov 2008 15:51:44 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-4445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646284#action_12646284
] 

Hemanth Yamijala commented on HADOOP-4445:
------------------------------------------

There are a few more issues with this information display. I had an offline discussion with
Vivek, and we came up with a few observations and ideas.

- The information is accessed by the UI update thread and the updateQSI method without proper
synchronization.This should be addressed. Currently, the QueueSchedulingInfo object is a simple
data object, and the information in a given instance of this should be accessed together.
Currently, the access to this object is done synchronized via the TaskSchedulingMgr object.
Maybe then, instead of accessing the QSI fields directly, it should access it via the TaskSchedulingMgr.

- The capacity scheduler also updates only the reduce scheduler or the map scheduler in a
given heartbeat. So in a scenario where reduce tasks are finishing along with map tasks, since
we update the reduce scheduler in preference to the map scheduler, the information for the
map tasks could be off by more than a heartbeat. However, in a steady state, this may not
be that big an issue.

There are some options to address this:
- We could make it explicit that the information is not synch'ed with the cluster summary
(as mentioned by Vivek above, though the information should probably not be treated as off
by only a heartbeat)
- We could ensure that the information of either the map and reduce scheduler is updated at
least once every so often. For e.g. we could update it once every 3 heartbeats or so.
- We could also have an updater thread that runs periodically and updates the numbers every
time it runs. We could use the same code for updates as the updateQSI method itself, thought
it could maintain a separate copy of the data, so as to not introduce synchronization constraints
on the scheduler. 

The advantage with the last two approaches is that we could deterministically say how far
off the scheduling info would be, as compared to the cluster status. For e.g. if the updater
thread runs once every 30 seconds, we could say the information would be off by 30 seconds.

Since in any case it appears that the information cannot be completely in sync, maybe we should
leave it simple for now, mark that the information is not synchronized with the cluster status,
and see if in steady state the information is way off. If that happens, we could fix it using
one of the methods I've stated above. Thoughts ?

> Wrong number of running map/reduce tasks are displayed in queue information.
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-4445
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4445
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/capacity-sched
>    Affects Versions: 0.19.0
>         Environment: Hadoop r705159, Queue=default, GC=100% MapCapacity=ReduceCapacity=212
>            Reporter: Karam Singh
>            Assignee: Sreekanth Ramakrishnan
>
> Wrong number of running map/reduce tasks are displayed in queue information.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message