hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith Sharma K S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7272) Enable timeline collector fault tolerance
Date Mon, 06 Nov 2017 08:22:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240021#comment-16240021
] 

Rohith Sharma K S commented on YARN-7272:
-----------------------------------------

thanks [~vrushalic] for putting up summary. 
Adding to above points, some of the pros and cons which are discussed in call are
Pros :
# Additional WAL layer would help recover async entities. This ensures no entities are lost
which are sent by TimelineV2Clients to collectors. 
Primarily 2 major down time trying to address with this JIRA i.e Collector JVM going down
or Collector machine itself going down. 
# WAL layer is independent service that run on collector. It does not tightly bind to back
end storage. This enables recovery of async entities nevertheless of any plugged in back end
storage. 

Cons :
# Ensuring all async entities are written into WAL would be costly operation because multiple
clients request will be waiting for writing into HDFS. This brings up request contention to
write into WAL to ensure atomicity. This slows down request processing from TimelineClients.

# This would become duplicated effort storing entities into WAL apart from back end storage!
# Since we keep only last 1 minute data, for every collector flush it is also required to
rename the file in hdfs. This operation lead to creation of entity file spread across the
cluster which lead to write performance slower since local write is always faster than remote
write! Probably this need to think how we can deal with single file overall collector lifetime
to keep track of last 1 minute entities only. I see *truncate* API in hdfs, this need to check
what does this api functionality.

I think _If cost of flushing into WAL for every async API is greater than or equal to cost
of flushing into HBase(as of now) then it is better to go for flushing into HBase direclty_.
But this approach tightly coupled with back end storage cost!

> Enable timeline collector fault tolerance
> -----------------------------------------
>
>                 Key: YARN-7272
>                 URL: https://issues.apache.org/jira/browse/YARN-7272
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineclient, timelinereader, timelineserver
>            Reporter: Vrushali C
>            Assignee: Rohith Sharma K S
>         Attachments: YARN-7272-wip.patch
>
>
> If a NM goes down and along with it the timeline collector aux service for a running
yarn app, we would like that yarn app to re-establish connection with a new timeline collector.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message