hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hu Ziqian (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-8234) Improve RM system metrics publisher's performance by pushing events to timeline server in batch
Date Sun, 17 Jun 2018 10:05:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-8234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Hu Ziqian updated YARN-8234:
    Attachment: YARN-8234.001.patch

> Improve RM system metrics publisher's performance by pushing events to timeline server
in batch
> -----------------------------------------------------------------------------------------------
>                 Key: YARN-8234
>                 URL: https://issues.apache.org/jira/browse/YARN-8234
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager, timelineserver
>    Affects Versions: 2.8.3
>            Reporter: Hu Ziqian
>            Assignee: Hu Ziqian
>            Priority: Critical
>         Attachments: YARN-8234-branch-, YARN-8234-branch-,
YARN-8234-branch-, YARN-8234.001.patch
> When system metrics publisher is enabled, RM will push events to timeline server via
restful api. If the cluster load is heavy, many events are sent to timeline server and the
timeline server's event handler thread locked. YARN-7266 talked about the detail of this
problem. Because of the lock, timeline server can't receive event as fast as it generated
in RM and lots of timeline event stays in RM's memory. Finally, those events will consume
all RM's memory and RM will start a full gc (which cause an JVM stop-world and cause a timeout
from rm to zookeeper) or even get an OOM. 
> The main problem here is that timeline can't receive timeline server's event as fast
as it generated. Now, RM system metrics publisher put only one event in a request, and most
time costs on handling http header or some thing about the net connection on timeline side.
Only few time is spent on dealing with the timeline event which is truly valuable.
> In this issue, we add a buffer in system metrics publisher and let publisher send events
to timeline server in batch via one request. When sets the batch size to 1000, in out experiment the
speed of the timeline server receives events has 100x improvement. We have implement this
function int our product environment which accepts 20000 app's in one hour and it works fine.
> We add following configuration:
>  * yarn.resourcemanager.system-metrics-publisher.batch-size: the size of system metrics
publisher sending events in one request. Default value is 1000
>  * yarn.resourcemanager.system-metrics-publisher.buffer-size: the size of the event buffer
in system metrics publisher.
>  * yarn.resourcemanager.system-metrics-publisher.interval-seconds: When enable batch
publishing, we must avoid that the publisher waits for a batch to be filled up and hold events
in buffer for long time. So we add another thread which send event's in the buffer periodically.
This config sets the interval of the cyclical sending thread. The default value is 60s.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message