hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-7272) Enable timeline collector fault tolerance
Date Mon, 06 Nov 2017 19:37:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240754#comment-16240754
] 

Varun Saxena edited comment on YARN-7272 at 11/6/17 7:36 PM:
-------------------------------------------------------------

Sorry for coming in a little late on this discussion, although we did discuss it during the
call.
The primary objective of fault tolerance is to ensure that the entities which are guaranteed
to be written by timeline service v2 are not lost. 
But writing every entity to some sort of WAL implementation would be expensive.

Now, we have 2 kinds of entity writes, sync and async.
Sync entities are guaranteed to be written to the backend via collector and an exception,
even for server-side failures, is returned i.e. we indicate to the client that an entity could
not be written all the way to the backend so that it can retry or take some other suitable
action.
Async entities, as the name suggests are written asynchronously. They are not guaranteed to
be written to the backend, by design. We initially cache them at the client side for some
time or till a sync entity arrives, combine them and then send them to collector. Moreover,
if any exception occurs in writing to the backend, the result is not propagated back to the
client. We only throw exceptions for client-side failures.
Async entities are later cached in HBase writer implementation too, inside collector, before
being flushed to HBase.

Sync writes hence should be used for publishing important events, while async writes should
be used for not so important events, losing which should not be a big deal in case of a failure.
For instance, publishing metric values every N seconds can be an asynchronous write, unless
the metric is very important, say, used for billing.

Keeping this in mind, a client can potentially do synchronous writes if it cares about durability
of entity data.
Furthermore, asynchronous writes can have other points of failure too. For instance, the collector
can crash while writing the async entity to WAL. In this case, we currently do not propagate
this error to timeline client i.e. client would not know which entity writes have failed.

Another possible case to handle is the case where storage is down i.e. instead of waiting
for sync entity call to wait, it can be potentially committed to WAL till backend is unavailable.
We can potentially explore this option. This is possible, say, in cases where HBase cluster
runs separately from the cluster where ATS is running.
For HBase, would HBaseAdmin#checkHBaseAvailable be sufficient to check if HBase storage is
down?

Thoughts?


was (Author: varun_saxena):
Sorry for coming in a little late on this discussion, although we did discuss it during the
call.
The primary objective of fault tolerance is to ensure that the entities which are guaranteed
to be written by timeline service v2 are not lost. 
But writing every entity to some sort of WAL implementation would be expensive.

Now, we have 2 kinds of entity writes, sync and async.
Sync entities are guaranteed to be written to the backend via collector and an exception,
even for server-side failures, is returned i.e. we indicate to the client that an entity could
not be written all the way to the backend so that it can retry or take some other suitable
action.
Async entities, as the name suggests are written asynchronously. They are not guaranteed to
be written to the backend, by design. We initially cache them at the client side for some
time or till a sync entity arrives, combine them and then send them to collector. Moreover,
if any exception occurs in writing to the backend, the result is not propagated back to the
client. We only throw exceptions for client-side failures.
Async entities are later cached in HBase writer implementation too, inside collector, before
being flushed to HBase.

Sync writes hence should be used for publishing important events, while async writes should
be used for not so important events, losing which should not be a big deal in case of a failure.
For instance, publishing metric values every N seconds can be an asynchronous write, unless
the metric is very important, say, used for billing.

Keeping this in mind, a client can potentially do synchronous writes if it cares about durability
of entity data.
Furthermore, asynchronous writes can have other points of failure too. For instance, the collector
can crash while writing the async entity to WAL. In this case, we currently do not propagate
this error to timeline client i.e. client would not know which entity writes have failed.

Another possible case to handle is the case where storage is down i.e. instead of waiting
for sync entity call to wait, it can be potentially committed to WAL till backend is unavailable.
We can potentially explore this option. Say, in cases where HBase cluster runs separately
from the cluster where ATS is running.
For HBase, would HBaseAdmin#checkHBaseAvailable be sufficient to check if HBase storage is
down?

Thoughts?

> Enable timeline collector fault tolerance
> -----------------------------------------
>
>                 Key: YARN-7272
>                 URL: https://issues.apache.org/jira/browse/YARN-7272
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineclient, timelinereader, timelineserver
>            Reporter: Vrushali C
>            Assignee: Rohith Sharma K S
>         Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a running
yarn app, we would like that yarn app to re-establish connection with a new timeline collector.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message