Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: yarn-issues@hadoop.apache.org
Date: Thu, 23 Apr 2015 04:44:39 +0000 (UTC)
From: "Vrushali C (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12786306.1427484032000.2408.1429764279540@Atlassian.JIRA>
In-Reply-To: <JIRA.12786306.1427484032000@Atlassian.JIRA>
References: <JIRA.12786306.1427484032000@Atlassian.JIRA>
 <JIRA.12786306.1427484032066@arcas>
Subject: [jira] [Updated] (YARN-3411) [Storage implementation] explore the
 native HBase write schema for storage
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/YARN-3411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vrushali C updated YARN-3411:
-----------------------------
    Attachment: YARN-3411.poc.2.txt

Attaching a patch that includes:
-  a HBaseTimelineWriterImpl class
- a test class for the same
- an EntityTableDetails class for storing some entity table specific constants and other functions
- a TimelineWriterUtils class which has utility functions that are useful while reading from and writing to hbase tables

The write function in HBaseTimelineWriterImpl class writes out the entire contents of a TimelineEntity object including it's info, config, metrics (timeseries), isRelatedTo and relatesTo fields. 

The metrics timeseries is written such that the hbase cell timestamp is set to the metric timestamp, the hbase cell column qualifier is the metric name and the value is the metric value. I also propose changing the TimelineMetric values to be "long" instead of "Object" (although this patch does not make that change). 

For the metrics column family, we should set a TTL of X days and MIN_VERSIONS = 1. That way, the timeseries info will be retained for X days by hbase and the latest value will always be retained. 

The test class spins up a MiniCluster via HBaseTestingUtility's startMiniCluster.  It creates one entity object with info, config, metrics (timeseries), isRelatedTo and relatesTo entities and writes it to the backend by invoking the write api in HBaseTimelineWriterImpl class. The test scans the entity table and reads back the entity details and verifies the values of each field, including the timeseries. 

Also attaching an eclipse console log that ran the unit test. 

The schema creation would be along the lines of this:
{code}
create 'ats.entity',
  {NAME => 'i', COMPRESSION => 'LZO', BLOOMFILTER => 'ROWCOL'},
  {NAME => 'm', VERSIONS => 2147483647, MIN_VERSIONS => 1, COMPRESSION => 'LZO', BLOCKCACHE => false, TTL => '2592000'},
  {NAME => 'c', COMPRESSION => 'LZO', BLOCKCACHE => false, BLOOMFILTER => 'ROWCOL' }

{code}

> [Storage implementation] explore the native HBase write schema for storage
> --------------------------------------------------------------------------
>
>                 Key: YARN-3411
>                 URL: https://issues.apache.org/jira/browse/YARN-3411
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: timelineserver
>            Reporter: Sangjin Lee
>            Assignee: Vrushali C
>            Priority: Critical
>         Attachments: ATSv2BackendHBaseSchemaproposal.pdf, YARN-3411.poc.2.txt, YARN-3411.poc.txt
>
>
> There is work that's in progress to implement the storage based on a Phoenix schema (YARN-3134).
> In parallel, we would like to explore an implementation based on a native HBase schema for the write path. Such a schema does not exclude using Phoenix, especially for reads and offline queries.
> Once we have basic implementations of both options, we could evaluate them in terms of performance, scalability, usability, etc. and make a call.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)