hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joep Rottinghuis (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3051) [Storage abstraction] Create backing storage read interface for ATS readers
Date Fri, 19 Jun 2015 19:23:03 GMT

    [ https://issues.apache.org/jira/browse/YARN-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14593826#comment-14593826

Joep Rottinghuis commented on YARN-3051:

Not all arguments are equally selective. For example, relatesTo (entities) are not stored
in individual cells that can be used as a push down predicate for the HBase tables. We'd have
to select all entities that match the other criteria, select the relatesTo string, parse it
into individual fields and do set operations on them.
  Set<TimelineEntity> getEntities(String userId, String clusterId, String flowId,
      String flowRunId, String appId, String entityType, Long limit,
      Long createdTimeBegin, Long createdTimeEnd, Long modifiedTimeBegin,
      Long modifiedTimeEnd, Set<TimelineEntity.Identifier> relatesTo,
      Set<TimelineEntity.Identifier> isRelatedTo, Set<KeyValuePair> info,
      Set<KeyValuePair> configs, Set<String> events, Set<String> metrics,
      EnumSet<Field> fieldsToRetrieve) throws IOException;

If we defer being able to effectively select a subset of columns, what does it actually mean
to specify a Set<KeyValuePair> ?
Can the value be null to indicate that we don't care what the value is and that means that
we want the column back in the result?

I think we should separate out predicates (give me all X where Y=Z) versus selectors (give
me all X...).
It is not clear in the latest patch if fully populated entities will be returned.

Makes sense. We could use a regex or club different configs into different groups and let
user query that group. But then the problem will be how do we specify those groups. So as
you say lets defer it and discuss it at length when we take it up.
One thing though, along the lines of patch submitted earlier, I can include something like
Map<String, NameValueRelations> for metrics in the interface for specifying relational
operations . It will support things like metricA>val1 and metricA<val2 as well(means
2 conditions on the same metric to specify a range). Thoughts ?

Before we invent our own way how to specify which columns (metrics, configs etc.) we'll retrieve
let's make sure that what we come up with can efficiently be mapped to our backing store.
As we've selected HBase as the major implementation to handle queries at scale, that means
that we need to think how to make effective use of filters (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterBase.html)
to aggressively reduce what we pull back from HBase. ColumnPrefixFilter for example will be
a good way to express which config columns to retrieve. A regex will be a poor way, as that
will result in having to pull back every columns, and then dropping values from a retrieved

Similarly, if our rowkeys are prefixed by users then creating an API that doesn't include
the user (only the cluster) means that we're doing a full table scan, albeit with skipfilters
that let us skip over users that we're not interested in.

In an earlier patch I saw NameValueRelation that was able to perform the operations. That
again assumes that all values will be retrieved from the backing store, and then filtered
in the reader before returned to the user. It will be more effective to make sure we can easily
map this to operations we can push into HBase itself (through a ColumnValueFilter) through
the available operations (https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html).
I'm certainly not arguing to have these HBase specific classes exposed in our API, but our
methods should closely match what can be done, which I don't think will be overly restrictive
or unreasonable.

If we're going to have two types of tables in the backing store:
a) HBase native tables, specifically structured for efficient storage and retrieval
b) Phoenix tables (mainly time based aggregates and aggregates over non-primary key prefixes),
specifically structured for flexible querying
would it make sense to break these two queries into separate families?
Or are we thinking that based on what arguments are passed in, we decide which tables to query
with which mechanism?

> [Storage abstraction] Create backing storage read interface for ATS readers
> ---------------------------------------------------------------------------
>                 Key: YARN-3051
>                 URL: https://issues.apache.org/jira/browse/YARN-3051
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>    Affects Versions: YARN-2928
>            Reporter: Sangjin Lee
>            Assignee: Varun Saxena
>         Attachments: YARN-3051-YARN-2928.003.patch, YARN-3051-YARN-2928.03.patch, YARN-3051-YARN-2928.04.patch,
YARN-3051.Reader_API.patch, YARN-3051.Reader_API_1.patch, YARN-3051.wip.02.YARN-2928.patch,
YARN-3051.wip.patch, YARN-3051_temp.patch
> Per design in YARN-2928, create backing storage read interface that can be implemented
by multiple backing storage implementations.

This message was sent by Atlassian JIRA

View raw message