chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] Updated: (CHUKWA-567) Create a generic down sampling framework for time series metrics in hbase
Date Sat, 18 Dec 2010 08:00:05 GMT

     [ https://issues.apache.org/jira/browse/CHUKWA-567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eric Yang updated CHUKWA-567:
-----------------------------

    Description: 
Large time series data can be down sampled in a generic way.  This jira is to create a general
down sampling framework which can be schedule in the background.  In theory, a configuration
file can specify which source table name, down sampled table name suffix and the interval
to down sample.

For example:

{noformat}
chukwa.data.sample.tables=SystemMetrics,Hadoop
chukwa.down.sample.suffix=_monthly,_yearly
chukwa.down.sample.frequency=30,360
{noformat}

By this configuration, down sample framework will trigger down sampling job every 30 and 360
minutes respectively for each of the SystemMetrics and Hadoop table.  The down sampled data
are stored into SystemMetrics_monthly, SystemMetrics_yearly and Hadoop_monthly, and Hadoop_yearly
respectively.

The down sampling framework will automatically create pig script with time and config parameters
filled in and trigger the script to run, and if there are columns that can not be down sampled
(non-numeric value), the first value will be used.  The down sampling framework will use time
and row key for grouping.

Oozie can be used as job scheduler for the down sampling framework, hence I only need to write
the pig script and Oozie workflow to plugin parameters.  Suggestion and recommendation are
welcome.

  was:
Large time series data can be down sampled in a generic way.  This jira is to create a general
down sampling framework which can be schedule in the background.  In theory, a configuration
file can specify which source table name, down sampled table name suffix and the interval
to down sample.

For example:

{noformat}
chukwa.data.sample.tables=SystemMetrics,Hadoop
chukwa.down.sample.suffix=_monthly,_yearly
chukwa.down.sample.frequency=30,90
{noformat}

By this configuration, down sample framework will trigger down sampling job every 30 and 360
minutes respectively for each of the SystemMetrics and Hadoop table.  The down sampled data
are stored into SystemMetrics_monthly, SystemMetrics_yearly and Hadoop_monthly, and Hadoop_yearly
respectively.

The down sampling framework will automatically create pig script with time and config parameters
filled in and trigger the script to run, and if there are columns that can not be down sampled
(non-numeric value), the first value will be used.  The down sampling framework will use time
and row key for grouping.

Oozie can be used as job scheduler for the down sampling framework, hence I only need to write
the pig script and Oozie workflow to plugin parameters.  Suggestion and recommendation are
welcome.


> Create a generic down sampling framework for time series metrics in hbase
> -------------------------------------------------------------------------
>
>                 Key: CHUKWA-567
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-567
>             Project: Chukwa
>          Issue Type: New Feature
>         Environment: Java 6, Mac OSX 10.6
>            Reporter: Eric Yang
>            Assignee: Eric Yang
>
> Large time series data can be down sampled in a generic way.  This jira is to create
a general down sampling framework which can be schedule in the background.  In theory, a configuration
file can specify which source table name, down sampled table name suffix and the interval
to down sample.
> For example:
> {noformat}
> chukwa.data.sample.tables=SystemMetrics,Hadoop
> chukwa.down.sample.suffix=_monthly,_yearly
> chukwa.down.sample.frequency=30,360
> {noformat}
> By this configuration, down sample framework will trigger down sampling job every 30
and 360 minutes respectively for each of the SystemMetrics and Hadoop table.  The down sampled
data are stored into SystemMetrics_monthly, SystemMetrics_yearly and Hadoop_monthly, and Hadoop_yearly
respectively.
> The down sampling framework will automatically create pig script with time and config
parameters filled in and trigger the script to run, and if there are columns that can not
be down sampled (non-numeric value), the first value will be used.  The down sampling framework
will use time and row key for grouping.
> Oozie can be used as job scheduler for the down sampling framework, hence I only need
to write the pig script and Oozie workflow to plugin parameters.  Suggestion and recommendation
are welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message