hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "qiubingxue (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-5814) Add druid as storage backend in YARN Timeline Service
Date Wed, 02 Nov 2016 02:16:58 GMT
qiubingxue created YARN-5814:
--------------------------------

             Summary:  Add druid as storage backend in YARN Timeline Service
                 Key: YARN-5814
                 URL: https://issues.apache.org/jira/browse/YARN-5814
             Project: Hadoop YARN
          Issue Type: New Feature
          Components: ATSv2
    Affects Versions: 3.0.0-alpha2
            Reporter: qiubingxue


h3. Introduction

I propose to add druid as storage backend in YARN Timeline Service.

We run more than 6000 applications and generate 450 million metrics daily in Alibaba Clusters
with thousands of nodes. We need to collect and store meta/events/metrics data, online analyze
the utilization reports of various dimensions and display the trends of allocation/usage resources
for cluster by joining and aggregating data. It helps us to manage and optimize the cluster
by tracking resource utilization.

To achieve our goal we have changed to use druid as the storage instead of HBase and have
achieved sub-second OLAP performance in our production environment for few months. 

h3. Analysis

Currently YARN Timeline Service only supports aggregating metrics at a) flow level by FlowRunCoprocessor
and b) application level metrics aggregating by AppLevelTimelineCollector, offline (time-based
periodic) aggregation for flows/users/queues for reporting and analysis is planned but not
yet implemented. YARN Timeline Service chooses Apache HBase as the primary storage backend.
As we all know that HBase doesn't fit for OLAP.

 For arbitrary exploration of data,such as online analyze the utilization reports of various
dimensions(Queue,Flow,Users,Application,CPU,Memory) by joining and aggregating data, Druid's
custom column format enables ad-hoc queries without pre-computation. The format also enables
fast scans on columns, which is important for good aggregation performance.

To achieve our goal that support to online analyze the utilization reports of various dimensions,
display the variation trends of allocation/usage resources for cluster, and arbitrary exploration
of data, we propose to add druid storage and implement DruidWriter /DruidReader in YARN Timeline
Service.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message