hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hitesh Shah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1529) Add Localization overhead metrics to NM
Date Mon, 23 Dec 2013 23:39:55 GMT

    [ https://issues.apache.org/jira/browse/YARN-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13856023#comment-13856023

Hitesh Shah commented on YARN-1529:

[~jira.shegalov] Could you add more details on how users should interpret these new metrics?
Does the cache ratio account for the local resource visibility i.e. public cache misses are
more important than cache misses for application visibility? I assume the "LocalizationDownloadNanos"
is an average per container? How does an average help when there are numerous application
types with diff no. of resources and each container facing a different cache hit ratio? Is
this something which needs to be augmented into the container status and not a general NM
metric? For that matter, what is the better option - trackinglocalization metrics on the NM
level or tracking them on a per container/per app level? 

Further thoughts:
 - Shouldn't there be a metric that tracks the actual size of the local resource cache on
 - How are public/private/application caches being considered?
 - What about different resource types - file/archive/pattern? 

> Add Localization overhead metrics to NM
> ---------------------------------------
>                 Key: YARN-1529
>                 URL: https://issues.apache.org/jira/browse/YARN-1529
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Gera Shegalov
>            Assignee: Gera Shegalov
>         Attachments: YARN-1529.v01.patch
> Users are often unaware of localization cost that their jobs incur. To measure effectiveness
of localization caches it is necessary to expose the overhead in the form of metrics.
> We propose addition of the following metrics to NodeManagerMetrics.
> When a container is about to launch, its set of LocalResources has to be fetched from
a central location, typically on HDFS, that results in a number of download requests for the
files missing in caches.
> LocalizedFilesMissed: total files (requests) downloaded from DFS. Cache misses.
> LocalizedFilesCached: total localization requests that were served from local caches.
Cache hits.
> LocalizedBytesMissed: total bytes downloaded from DFS due to cache misses.
> LocalizedBytesCached: total bytes satisfied from local caches.
> Localized(Files|Bytes)CachedRatio: percentage of localized (files|bytes) that were served
out of cache: ratio = 100 * caches / (caches + misses)
> LocalizationDownloadNanos: total elapsed time in nanoseconds for a container to go from
ResourceRequestTransition to LocalizedTransition

This message was sent by Atlassian JIRA

View raw message