hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vinod Kumar Vavilapalli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
Date Fri, 13 Oct 2017 06:00:02 GMT

    [ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16203096#comment-16203096

Vinod Kumar Vavilapalli commented on YARN-7272:

bq. In 1st cases, there will be outstanding unflushed entities in app collector buffer. If
NM is restarted then it will looses all the outstanding entities from app collector buffer.
So, scope of fault tolerance is restricted to NM JVM restart only
bq. In 2nd case, since NM machine itself is down which looses all the running master containers.
RM will launches these master container in different machine as a second attempt.
This assumes that the collector lives inside the NM. One of the design goals for large scale
apps is to fork the collector into its own container. When that is implemented, the above
assumptions will be invalidated. We will have new fault scenarios where collector and AM may
run on different machines, only collector dies and restarts on a different machine etc.

bq. Since it is fresh attempt, old attempt data is not much important to end user. Considering
this behavior, 2nd case can be eliminated by considering for fault tolerance of app collectors.

If our goal is to take care of entity/event data in transit for 1 min (assuming the collector
flush interval is 1 min), we should be equally concerned about data loss either due to NM
failure or machine failure or HBase failures.

Granted a HBase client buffer solution is faster / cheaper than levelDB solution which is
in turn faster /cheaper than writing a JobHistory like WAL to HDFS. But the last one will
encompass all those faults collectively, no?

> Enable timeline collector fault tolerance
> -----------------------------------------
>                 Key: YARN-7272
>                 URL: https://issues.apache.org/jira/browse/YARN-7272
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineclient, timelinereader, timelineserver
>            Reporter: Vrushali C
>            Assignee: Rohith Sharma K S
> If a NM goes down and along with it the timeline collector aux service for a running
yarn app, we would like that yarn app to re-establish connection with a new timeline collector.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message