hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Lu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2928) YARN Timeline Service: Next generation
Date Fri, 05 Jun 2015 22:35:05 GMT

    [ https://issues.apache.org/jira/browse/YARN-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14575358#comment-14575358

Li Lu commented on YARN-2928:

Hi [~jamestaylor]

Thank you very much for your suggestions and PHOENIX-2028! I wrote the experimental Phoenix
writer code and currently have some follow up questions w.r.t your comments. 

bq. The easiest is probably to create the HBase table the same way (through code or using
the HBase shell) with the KeyPrefixRegionSplitPolicy specified at create time. Then, in Phoenix
you can issue a CREATE TABLE statement against the existing HBase table and it'll just map
to it. Then you'll have your split policy for your benchmark in both write paths.

If I understand this correctly, in this case, Phoenix will inherit pre-split settings from
HBase? Will this alter the existing HBase table, including its schema and/or data inside?
In general, if one runs CREATE TABLE IF NOT EXISTS or simply CREATE TABLE commands over a
pre-split existing HBase table, will Phoenix simply accept the existing table as-is? 

bq. An alternative to dynamic columns is to define views over your Phoenix table (http://phoenix.apache.org/views.html).

I once looked at views but I'm not sure if that fits our write path use case well. Let me
briefly talk about our use case in YARN first. In general, we would like to dynamically store
the configuration and metrics for each YARN timeline entity in a Phoenix database, such that
our timeline reader apps or users can use SQL to query historical data. Phoenix view may make
a perfect solution for the reader use cases. However, we are hitting problems on the writer
side. We store each configuration/metric key-value pair in a dynamic column. This causes us
two main troubles. First, we need to use a dynamically generated SQL statement to write to
the Phoenix table which is cumbersome and error-prone. Second, when performing aggregations,
we need to aggregate on all available metrics for an application (or a user, flow), but we
cannot simply iterate on those dynamic columns because there is no such API. I'm not sure
how to resolve these two problems via Phoenix view, or via existing Phoenix APIs. Actually,
I suspect that if it's possible to fall back to the HBase-style APIs, our writer path would
be much simpler. 

bq. If you do end up going with a direct HBase write path, I'd encourage you to use the Phoenix
serialization format (through PDataType and derived classes) to ensure you can do adhoc querying
on the data.

We're currently looking into this method in the aggregation part. We're doing our best to
support SQL on the aggregated data by using Phoenix. One potential solution is to use HBase
coprocessors to aggregate application data from the HBase storage, and then store them in
a Phoenix aggregation table. However, if we want to keep aggregating on the Phoenix table,
can we also write a HBase coprocessor that read the Phoenix PDataTypes, and aggregate them
into other Phoenix tables? If it's possible, are there any stable (or "safe") APIs for PDataTypes?

A slightly more generalized question here is, is SQL the _only_ API for Phoenix, or there
may be more? I ask this question because from a YARN timeline service perspective, Phoenix
is a nice tool through which we can easily add SQL support to our final users, but we may
not necessarily use SQL to program it all the time? 

Thank you very much for your comments and help from the Phoenix side. Our current Phoenix
writer is more of an experimental version, but we really hope to have something for our aggregators
and readers in near future. 

> YARN Timeline Service: Next generation
> --------------------------------------
>                 Key: YARN-2928
>                 URL: https://issues.apache.org/jira/browse/YARN-2928
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Sangjin Lee
>            Priority: Critical
>         Attachments: ATSv2.rev1.pdf, ATSv2.rev2.pdf, Data model proposal v1.pdf, Timeline
Service Next Gen - Planning - ppt.pptx, TimelineServiceStoragePerformanceTestSummaryYARN-2928.pdf
> We have the application timeline server implemented in yarn per YARN-1530 and YARN-321.
Although it is a great feature, we have recognized several critical issues and features that
need to be addressed.
> This JIRA proposes the design and implementation changes to address those. This is phase
1 of this effort.

This message was sent by Atlassian JIRA

View raw message