hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sangjin Lee (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4074) [timeline reader] implement support for querying for flows and flow runs
Date Thu, 17 Sep 2015 18:37:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-4074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14803384#comment-14803384

Sangjin Lee commented on YARN-4074:

In TimelineEntityReader#readMetrics it seems safe to assume that if we have more than one
value that this is a TimelineMetric.Type.TIME_SERIES.
Conversely it doesn't have to be true though right? I guess we'll just assume that for timelines
we'd never have just one value? I can't quite oversee the impact of incorrectly assuming TimelineMetric.Type.SINGLE_VALUE
if only one value has been written to HBase yet.

That's right. We discussed this some time ago, and we think it'd be safer if the metric type
(single value vs. time series) were stored/persisted. But there are other dimensions of metrics
we may need to store (e.g. long vs. float, whether to aggregate, etc.). Also, there is a question
of what if users wrote inconsistent data. So, at that time we went with a simple decision
that's currently there (the code you see in {{TimelineEntityReader}} is refactored out of
{{HBaseTimelineReaderImpl}} so it's not new code).

We should come to a conclusion on how to store/encode various dimensions of metrics, but not
as part of this JIRA.

Wrt. ApplicationRowKey: at some point (perhaps not this jira) we should consider making the
app_id a compound object that is stored with a ? separator. The prefix (in most cases in yarn
right now would be "application_") would be separate and the RM start time and the final numeric
part would be stored as a numerical value with a separate Bytes.to... conversion.

Otherwise we'll end up getting incorrect order for rowkeys when the application id wraps to
10K and each power of ten after that. For example, lexically application_1442351767756_10000
< application_1442351767756_9999

If we just access the application by specific key this doesn't matter, but if we do a row-scan
and count on ordering to set an appropriate stop on the scan, we'll break things.
This happens on all rowkeys with the app_id in it.

That's a good point. We need to fix this, or we'll have incorrect orders/results happening
with queries. This impacts anywhere we rely on the app id order (as string). I'll file a separate
JIRA to address this issue.

> [timeline reader] implement support for querying for flows and flow runs
> ------------------------------------------------------------------------
>                 Key: YARN-4074
>                 URL: https://issues.apache.org/jira/browse/YARN-4074
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>         Attachments: YARN-4074-YARN-2928.007.patch, YARN-4074-YARN-2928.POC.001.patch,
YARN-4074-YARN-2928.POC.002.patch, YARN-4074-YARN-2928.POC.003.patch, YARN-4074-YARN-2928.POC.004.patch,
YARN-4074-YARN-2928.POC.005.patch, YARN-4074-YARN-2928.POC.006.patch
> Implement support for querying for flows and flow runs.
> We should be able to query for the most recent N flows, etc.
> This includes changes to the {{TimelineReader}} API if necessary, as well as implementation
of the API.

This message was sent by Atlassian JIRA

View raw message