Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5922EEE59 for ; Mon, 4 Feb 2013 15:32:14 +0000 (UTC) Received: (qmail 79918 invoked by uid 500); 4 Feb 2013 15:32:14 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 79827 invoked by uid 500); 4 Feb 2013 15:32:14 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 79787 invoked by uid 99); 4 Feb 2013 15:32:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 04 Feb 2013 15:32:13 +0000 Date: Mon, 4 Feb 2013 15:32:13 +0000 (UTC) From: "Robert Joseph Evans (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-371) Consolidate resource requests in AM-RM heartbeat MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570329#comment-13570329 ] Robert Joseph Evans commented on YARN-371: ------------------------------------------ Tom just like Arun said the memory usage changes based off of the size of the cluster vs. the size of the request. The current approach is on the order of the size of the cluster where as the proposed approach is on the order of the number of desired containers. If I have a 100 node cluster and I am requesting 100000 map tasks the size will be O(100 nodes + X racks + 1) possibly * 2 if reducers are included in it. What is more it is probably exactly the same size of request for 10000 or even 1000 tasks. Where as the proposed approach would grow without bound as the number of tasks also increased. However, I also agree with Sandy that the current state compression is lossy and as such restricts what is possible in the scheduler. I would like to understand better what the size differences would be for various requests, both in memory and also over the wire. It seems conceivable to me that if the size difference is not too big, especially over the wire, we could allow the scheduler itself to decide on its in memory representation. This would allow for the Capacity Scheduler to keep its current layout and allow for others to experiment with more advanced scheduling options. Different groups could decide which scheduler best fits their needs and workload. If the size is significantly larger I would like to see hard numbers about how much better/worse it makes specific use cases. I am also very concerned about adding too much complexity to the scheduler. We have run into issues where the RM will get very far behind in scheduling because it is trying to do a lot already and eventually OOM as its event queue grows too large. I also don't want to change the scheduler protocol too much without first understanding how that new protocol would impact other potential scheduling features. There are a number of other computing patterns that could benefit from specific scheduler support. Things like gang scheduling where you need all of the containers at once or none of them can make any progress, or where you want all of the containers to be physically close to one another because they are very I/O intensive, but you don't really care where exactly they are. Or even something like HBase where you essentially want one process on every single node with no duplicates. Do the proposed changes make these uses case trivially simple, or do they require a lot of support on the AM to implement them? > Consolidate resource requests in AM-RM heartbeat > ------------------------------------------------ > > Key: YARN-371 > URL: https://issues.apache.org/jira/browse/YARN-371 > Project: Hadoop YARN > Issue Type: Improvement > Components: api, resourcemanager, scheduler > Affects Versions: 2.0.2-alpha > Reporter: Sandy Ryza > Assignee: Sandy Ryza > > Each AMRM heartbeat consists of a list of resource requests. Currently, each resource request consists of a container count, a resource vector, and a location, which may be a node, a rack, or "*". When an application wishes to request a task run in multiple localtions, it must issue a request for each location. This means that for a node-local task, it must issue three requests, one at the node-level, one at the rack-level, and one with * (any). These requests are not linked with each other, so when a container is allocated for one of them, the RM has no way of knowing which others to get rid of. When a node-local container is allocated, this is handled by decrementing the number of requests on that node's rack and in *. But when the scheduler allocates a task with a node-local request on its rack, the request on the node is left there. This can cause delay-scheduling to try to assign a container on a node that nobody cares about anymore. > Additionally, unless I am missing something, the current model does not allow requests for containers only on a specific node or specific rack. While this is not a use case for MapReduce currently, it is conceivable that it might be something useful to support in the future, for example to schedule long-running services that persist state in a particular location, or for applications that generally care less about latency than data-locality. > Lastly, the ability to understand which requests are for the same task will possibly allow future schedulers to make more intelligent scheduling decisions, as well as permit a more exact understanding of request load. > I would propose the tweak of allowing a single ResourceRequest to encapsulate all the location information for a task. So instead of just a single location, a ResourceRequest would contain an array of locations, including nodes that it would be happy with, racks that it would be happy with, and possibly *. Side effects of this change would be a reduction in the amount of data that needs to be transferred in a heartbeat, as well in as the RM's memory footprint, becaused what used to be different requests for the same task are now able to share some common data. > While this change breaks compatibility, if it is going to happen, it makes sense to do it now, before YARN becomes beta. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira