hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4712) CPU Usage Metric is not captured properly in YARN-2928
Date Sat, 05 Mar 2016 01:52:40 GMT

    [ https://issues.apache.org/jira/browse/YARN-4712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181424#comment-15181424
] 

Sangjin Lee commented on YARN-4712:
-----------------------------------

I agree that we should take care of the UNAVAILABLE metrics via YARN-4308. My position is
that we should skip reporting the value rather than reporting 0.

I know I might be opening a can of worms, but I'd like to raise a couple of points as they
are closely related to this.

First, what should we report via {{NMTimelinePublisher}}? There are 2 choices: {{cpuUsagePercentPerCore}}
(300% in the example mentioned in the comment) and {{cpuUsageTotalCoresPercentage}} (50% in
the same example). I see that we're storing {{cpuUsageTotalCoresPercentage}}. I wonder if
that is the best choice here.

For example, consider a cluster with workers with substantially different capacity (number
of cores). If we used the latter and tried to aggregate them later for the application or
the flow, this would lead to a highly misleading sum. 50% of a 6-core node is very different
than 50% of a 24-core node.

Most of YARN's CPU accounting is based on cores rather than nodes/machines. IMO {{cpuUsagePercentPerCore}}
would be a better value to emit. Thoughts?

The second point is the following line in the existing code:
{code}
        cpuMetric.setId(ContainerMetric.CPU.toString() + pId);
{code}

I vaguely remember reading this line and being puzzled. Why are we appending the process id
to the metric id? Doesn't this cause issues when we do the aggregation? For example, suppose
we have a container #1 (process id = 1234) on some machine whose CPU usage is 10%, and container
#2 (process id = 5678) on another machine whose CPU usage is 20%. The object model will be

{noformat}
(container #1) -> (metric) -> ("CPU1234" => 10)
(container #2) -> (metric) -> ("CPU5678" => 20)
{noformat}

But we want to add them for the parent application. It would be real awkward to add these
metrics with different keys. Why is process id needed here in the first place?

> CPU Usage Metric is not captured properly in YARN-2928
> ------------------------------------------------------
>
>                 Key: YARN-4712
>                 URL: https://issues.apache.org/jira/browse/YARN-4712
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Naganarasimha G R
>            Assignee: Naganarasimha G R
>              Labels: yarn-2928-1st-milestone
>         Attachments: YARN-4712-YARN-2928.v1.001.patch, YARN-4712-YARN-2928.v1.002.patch,
YARN-4712-YARN-2928.v1.003.patch
>
>
> There are 2 issues with CPU usage collection 
> * I was able to observe that that many times CPU usage got from {{pTree.getCpuUsagePercent()}}
is ResourceCalculatorProcessTree.UNAVAILABLE(i.e. -1) but ContainersMonitor do the calculation
 i.e. {{cpuUsageTotalCoresPercentage = cpuUsagePercentPerCore /resourceCalculatorPlugin.getNumProcessors()}}
because of which UNAVAILABLE check in {{NMTimelinePublisher.reportContainerResourceUsage}}
is not encountered. so proper checks needs to be handled
> * {{EntityColumnPrefix.METRIC}} uses always LongConverter but ContainerMonitor is publishing
decimal values for the CPU usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message