chukwa-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] Commented: (CHUKWA-444) Redefine Chukwa time series storage
Date Fri, 26 Feb 2010 18:13:28 GMT

    [ https://issues.apache.org/jira/browse/CHUKWA-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838970#action_12838970
] 

Eric Yang commented on CHUKWA-444:
----------------------------------

More refined plan:

Type 1 Data:

Having a post demux data loader which wait to receive new ChukwaRecords files, and merge with
the existing ChukwaRecords files through a second MR
job.  The second MR job also produces low resolution of the data for report.

/chukwa/repos/TYPE/DATE <-- Original data goes here.
/chukwa/report/TYPE/[yearly,monthly,weekly,daily] <-- Summarized JSON data goes here.

The report JSON will be fixed to 300 data points per series, optimized for graphing.

Type 2 data for plain text searching:

After data has been archived, use full body indexer like lucene to build searchable indexes.

Architecture look like this:

{noformat}
Adaptor -> Agent -> Collector |-> Archive -> Full Body Index |-> Retention
                              +-> Demux   -> Aggregation     |-> Retention
                                                             +-> Hicc
{noformat}                                             

> Redefine Chukwa time series storage
> -----------------------------------
>
>                 Key: CHUKWA-444
>                 URL: https://issues.apache.org/jira/browse/CHUKWA-444
>             Project: Hadoop Chukwa
>          Issue Type: New Feature
>          Components: Data Processors
>         Environment: Redhat EL 5.1, Java 6
>            Reporter: Eric Yang
>
> The current Chukwa Record format is not suitable for data visualization.  It is more
like an archive format which combines data from multiple sources (hosts), and group them into
a sorted time partitioned sequence file.  Most of people collected data for two reasons, archive
and data analysis.  The current chukwa record format is fine for archive, but it is not so
great for data analysis.  Data analysis could be further break down into two different types.
 1) Data can be aggregated and summarized, such as metrics.  2) Data that can not be summarized,
like job history.  Type 1 data is useful for visualization by graph, and type 2 data is useful
by plain text viewing or search for a particular event.
> By the above rational, it probably makes sense to restructure Chukwa Records for data
analysis.  Outside of Hadoop world, rrdtools is great for time series data storage, and optimized
for metrics from a single source, i.e. a host.  RRD data file fragments badly when there are
hundred of thousands of sources.  Chukwa time series data storage should be able to combine
multiple data sources into one Chukwa file to combat file fragmentation problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message