chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <>
Subject [jira] [Commented] (CHUKWA-667) Optimize the HBase schema for Ganglia queris
Date Sun, 11 Jan 2015 22:46:34 GMT


Eric Yang commented on CHUKWA-667:

Resume progress on this issue.  We have learnt a few lessons on metrics schema design in the
last couple years.  Monotonic increasing row key is bad for HBase region server.  We have
built an alternate HBase schema, like this:

[group name].[date].[metric]:[primary key], and column family "m" and cell name "m".  This
provides a way to prefix split of regions, but it also isn't great.  The advantage is to have
one table for all type of metrics.  We observed that some group may have metrics growing faster
than other group, therefore, the region split still need a lot of manual maintenance to prevent
HBase from blowing up.

A new proposal is to change metrics schema design to partition by day of the month.  More
than often that time series database have two requirements, fast lookup by time and fast lookup
for the same metrics.  This means the row key need to have hints for partition by day and
partition by primary key.  A improved schema can be generated by:

Table: [group name]
Row Key: [day:primary_key]
Column Family: [subgroup name]
Column: [metric name]
Timestamp: [actual timestamp]

Example of a Hadoop table would look like:

Table: Hadoop
Row Key:
Column Family: HDFS
Column: datanode_bytes_read
Timestamp: 1234567890
Value: 123

Units, and metrics type can be stored in a secondary table for rendering and metadata lookup
to reduce storage space.  Thoughts?

> Optimize the HBase schema for Ganglia queris
> --------------------------------------------
>                 Key: CHUKWA-667
>                 URL:
>             Project: Chukwa
>          Issue Type: Sub-task
>          Components: Data Processors
>    Affects Versions: 0.6.0
>            Reporter: Saisai Shao
> Chukwa HBase table schema is designed for HICC, it cannot be fully adapted to Ganglia
web frontend for several reasons:
> (1) cannot fastly retrieve all the cluster and related host names.
> (2) system metrics have no attributes, like type, unit, so it is hard to explain the
collected metrics by code.
> (3) lack of data cosolidate function, choosing metric for a large time range (like 30
days) will fetch all the data and draw graph, which will largely lose performance.
> We will redesign the table schema that will be better adapted to Ganglia web frontend

This message was sent by Atlassian JIRA

View raw message