hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Lu (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-2673) Add retry for timeline client
Date Wed, 15 Oct 2014 02:55:33 GMT

     [ https://issues.apache.org/jira/browse/YARN-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Li Lu updated YARN-2673:
    Attachment: YARN-2673-101414.patch

Upload a patch for this issue. TimelineClient will by default retry for a given amount of
time before throw the exception on posting to server. There are a few notes:

1. Retrying vs. discarding timeline data: If we do not adding this retry, timeline client
will drop the posted data if the first attempt has failed. Had a offline discussion with [~vinodkv].
We agreed that blocking the timeline client for a short while is better, since we may not
want to drop some critical timeline data. 

2. Retry behavior configurations: Users can define maximum retry counts, and time interval
between consecutive retries. We may want to have two levels of retry settings: a cluster global
settings, managed by yarn-site.xml, and a per-application customize setting. For the cluster
setting, I've added two configuration properties, yarn.timeline-service.client.max-retries
(default 30) and yarn.timeline-service.client.retry-interval-ms (default 1000). I've also
provide a customizeRetrySettings method for application specific retry settings. 

3. Retry implementation: timeline client does not use RPC, but uses RESTful APIs. I'm implementing
retry as a jersey filter in this patch. 

4. Tests: I added two new unit tests, one to test the customizeRetrySettings API and the other
to test if the retry has actually happened when we try to post  timeline entities. 

> Add retry for timeline client
> -----------------------------
>                 Key: YARN-2673
>                 URL: https://issues.apache.org/jira/browse/YARN-2673
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Li Lu
>            Assignee: Li Lu
>         Attachments: YARN-2673-101414.patch
> Timeline client now does not handle the case gracefully when the server is down. Jobs
from distributed shell may fail due to ATS restart. We may need to add some retry mechanisms
to the client. 

This message was sent by Atlassian JIRA

View raw message