chukwa-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ey...@apache.org
Subject svn commit: r1210325 - in /incubator/chukwa/trunk/src/site/apt: admin.apt dataflow.apt programming.apt
Date Mon, 05 Dec 2011 04:06:44 GMT
Author: eyang
Date: Mon Dec  5 04:06:44 2011
New Revision: 1210325

URL: http://svn.apache.org/viewvc?rev=1210325&view=rev
Log:
CHUKWA-612. Convert Chukwa document from forrest format to apt format. (Eric Yang)

Added:
    incubator/chukwa/trunk/src/site/apt/dataflow.apt
      - copied, changed from r1208953, incubator/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml
    incubator/chukwa/trunk/src/site/apt/programming.apt
      - copied, changed from r1208953, incubator/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml
Modified:
    incubator/chukwa/trunk/src/site/apt/admin.apt

Modified: incubator/chukwa/trunk/src/site/apt/admin.apt
URL: http://svn.apache.org/viewvc/incubator/chukwa/trunk/src/site/apt/admin.apt?rev=1210325&r1=1210324&r2=1210325&view=diff
==============================================================================
--- incubator/chukwa/trunk/src/site/apt/admin.apt (original)
+++ incubator/chukwa/trunk/src/site/apt/admin.apt Mon Dec  5 04:06:44 2011
@@ -43,7 +43,7 @@ development.
   The only absolute software requirements are {{{http://java.sun.com}Java 1.6}}
 or better and {{{http://hadoop.apache.org/}Hadoop 0.20.205.1+}}.
   
-  HICC, the Chukwa visualization interface, {{{#Set+Up+the+Database}requires HBase 0.90.4+}}.
+  HICC, the Chukwa visualization interface, requires {{{http://hbase.apache.org}HBase 0.90.4+}}.
 
   The Chukwa cluster management scripts rely on <ssh>; these scripts, however,
 are not required if you have some alternate mechanism for starting and stopping
@@ -183,7 +183,7 @@ Hadoop configuration files.
 
   * Copy CHUKWA_HOME/etc/chukwa/hadoop-metrics2.properties file to HADOOP_CONF_DIR/hadoop-metrics2.properties
 
-  * Edit HADOOP_HOME/etc/hadoop/hadoop-metrics2.properties file and change ${CHUKWA_LOG_DIR}
to your actual CHUKWA log dirctory (ie, CHUKWA_HOME/var/log)
+  * Edit HADOOP_HOME/etc/hadoop/hadoop-metrics2.properties file and change $CHUKWA_LOG_DIR
to your actual CHUKWA log dirctory (ie, CHUKWA_HOME/var/log)
 
 Setup HBase Table
 

Copied: incubator/chukwa/trunk/src/site/apt/dataflow.apt (from r1208953, incubator/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml)
URL: http://svn.apache.org/viewvc/incubator/chukwa/trunk/src/site/apt/dataflow.apt?p2=incubator/chukwa/trunk/src/site/apt/dataflow.apt&p1=incubator/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml&r1=1208953&r2=1210325&rev=1210325&view=diff
==============================================================================
--- incubator/chukwa/trunk/src/docs/src/documentation/content/xdocs/dataflow.xml (original)
+++ incubator/chukwa/trunk/src/site/apt/dataflow.apt Mon Dec  5 04:06:44 2011
@@ -1,38 +1,30 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one or more
-  contributor license agreements.  See the NOTICE file distributed with
-  this work for additional information regarding copyright ownership.
-  The ASF licenses this file to You under the Apache License, Version 2.0
-  (the "License"); you may not use this file except in compliance with
-  the License.  You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License.
--->
-<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" 
-"http://forrest.apache.org/dtd/document-v20.dtd">
-
-<document>
-  <header>
-    <title>Guide to Chukwa Storage Layout</title>
-  </header>
-  <body>
-
-<section><title>Overview</title>
-<p>This document describes how Chukwa data is stored in HDFS and the processes that
act on it.</p>
-</section>
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+~~
+
+Chukwa Storage Layout
+
+Overview
 
-<section><title>HDFS File System Structure</title>
+  This document describes how Chukwa data is stored in HDFS and the processes that act on
it.
 
-<p>The general layout of the Chukwa filesystem is as follows.</p>
+HDFS File System Structure
 
-<source>
+  The general layout of the Chukwa filesystem is as follows.
+
+---
 /chukwa/
    archivesProcessing/
    dataSinkArchives/
@@ -43,87 +35,84 @@
    repos/
    rolling/
    temp/
-</source>
-</section>
+---
+
+Raw Log Collection and Aggregation Workflow
+
+  What data is stored where is best described by stepping through the Chukwa workflow.
+
+  [[1]] Collectors write chunks to <logs/*.chukwa> files until a 64MB chunk size is
reached or a given time interval has passed.
+
+        * <logs/*.chukwa> 
+
+  [[2]] Collectors close chunks and rename them to <*.done>
+
+        * from <logs/*.chukwa>
+
+        * to <logs/*.done>
+
+  [[3]] DemuxManager checks for <*.done> files every 20 seconds.
+
+    [[1]] If <*.done> files exist, moves files in place for demux processing:
+
+           * from: <logs/*.done>
+
+           * to: <demuxProcessing/mrInput>
+
+    [[2]] The Demux MapReduce job is run on the data in <demuxProcessing/mrInput>.
+
+    [[3]] If demux is successful within 3 attempts, archives the completed files:
+
+           * from: <demuxProcessing/mrOutput>
+
+           * to: <dataSinkArchives/[yyyyMMdd]/*/*.done>
+
+    [[4]] Otherwise moves the completed files to an error folder:
+
+           * from: <demuxProcessing/mrOutput>
+
+           * to: <dataSinkArchives/InError/[yyyyMMdd]/*/*.done>
+
+  [[4]] PostProcessManager wakes up every few minutes and aggregates, orders and de-dups
record files.
+
+        * from: <postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt>
+
+        * to: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt>
+
+  [[5]] HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs
to hourly.
+
+        * from: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt>
+
+        * to: <temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]>
+
+        * to: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt>
+
+        * leaves: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/>
+
+  [[6]] DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily.
+
+        * from: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt>
+
+        * to: <temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]>
+
+        * to: <repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt>
+
+        * leaves: <repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/>
+
+  [[7]] ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives
data using M/R.
+
+        * from: <dataSinkArchives/[yyyyMMdd]/*/*.done>
+
+        * to: <archivesProcessing/mrInput>
+
+        * to: <archivesProcessing/mrOutput>
+
+        * to: <finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*>
+
+Log Directories Requiring Cleanup
 
-<section><title>Raw Log Collection and Aggregation Workflow</title>
+  The following directories will grow over time and will need to be periodically pruned:
 
-<p>What data is stored where is best described by stepping through the Chukwa workflow.</p>
+  * <finalArchives/[yyyyMMdd]/*>
 
-<ol>
-<li>Collectors write chunks to <code>logs/*.chukwa</code> files until a
64MB chunk size is reached or a given time interval has passed.
-  <ul><li><code>logs/*.chukwa</code></li></ul> 
-</li>
-<li>Collectors close chunks and rename them to <code>*.done</code>
-<ul>
-<li>from <code>logs/*.chukwa</code></li>
-<li>to <code>logs/*.done</code></li>
-</ul>
-</li>
-<li>DemuxManager checks for <code>*.done</code> files every 20 seconds.
- <ol>
-  <li>If <code>*.done</code> files exist, moves files in place for demux
processing:
-   <ul>
-     <li>from: <code>logs/*.done</code></li>
-     <li>to: <code>demuxProcessing/mrInput</code></li>
-   </ul>
-  </li>
-  <li>The Demux MapReduce job is run on the data in <code>demuxProcessing/mrInput</code>.</li>
-  <li>If demux is successful within 3 attempts, archives the completed files:
-    <ul>
-     <li>from: <code>demuxProcessing/mrOutput</code></li>
-     <li>to: <code>dataSinkArchives/[yyyyMMdd]/*/*.done</code> </li>
-    </ul>
-  </li>
-  <li>Otherwise moves the completed files to an error folder:
-    <ul>
-     <li>from: <code>demuxProcessing/mrOutput</code></li>
-     <li>to: <code>dataSinkArchives/InError/[yyyyMMdd]/*/*.done</code>
</li>
-    </ul>
-   </li>
-  </ol>
-</li>
-<li>PostProcessManager wakes up every few minutes and aggregates, orders and de-dups
record files.
-  <ul><li>from: <code>postProcess/demuxOutputDir_*/[clusterName]/[dataType]/[dataType]_[yyyyMMdd]_[HH].R.evt</code></li>
-  <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[HH]_[N].[N].evt</code></li>
-  </ul>
-</li>
-<li>HourlyChukwaRecordRolling runs M/R jobs at 16 past the hour to group 5 minute logs
to hourly.
-  <ul>
-  <li>from: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[mm]/[dataType]_[yyyyMMdd]_[mm].[N].evt</code></li>
-  <li>to: <code>temp/hourlyRolling/[clusterName]/[dataType]/[yyyyMMdd]</code></li>
-  <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_HourlyDone_[yyyyMMdd]_[HH].[N].evt</code></li>
-  <li>leaves: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/rotateDone/</code>
</li>
-  </ul>
-</li>
-<li>DailyChukwaRecordRolling runs M/R jobs at 1:30AM to group hourly logs to daily.
-  <ul>
-  <li>from: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[HH]/[dataType]_[yyyyMMdd]_[HH].[N].evt</code></li>
-  <li>to: <code>temp/dailyRolling/[clusterName]/[dataType]/[yyyyMMdd]</code></li>
-  <li>to: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/[dataType]_DailyDone_[yyyyMMdd].[N].evt</code></li>
-  <li>leaves: <code>repos/[clusterName]/[dataType]/[yyyyMMdd]/rotateDone/</code>
</li>
-  </ul>
-  </li> 
-<li>ChukwaArchiveManager every half hour or so aggregates and removes dataSinkArchives
data using M/R.
-  <ul>
-  <li>from: <code>dataSinkArchives/[yyyyMMdd]/*/*.done</code></li>
-  <li>to: <code>archivesProcessing/mrInput</code></li>
-  <li>to: <code>archivesProcessing/mrOutput</code></li>
-  <li>to: <code>finalArchives/[yyyyMMdd]/*/chukwaArchive-part-*</code>
</li>
-  </ul>
-  </li> 
- </ol>
- </section> 
-
-<section>
-<title>Log Directories Requiring Cleanup</title>
-
-<p>The following directories will grow over time and will need to be periodically pruned:</p>
-
-<ul>
-<li><code>finalArchives/[yyyyMMdd]/*</code></li>
-<li><code>repos/[clusterName]/[dataType]/[yyyyMMdd]/*.evt</code> </li>
-</ul>
-</section>
-</body>
-</document>
\ No newline at end of file
+  * <repos/[clusterName]/[dataType]/[yyyyMMdd]/*.evt>

Copied: incubator/chukwa/trunk/src/site/apt/programming.apt (from r1208953, incubator/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml)
URL: http://svn.apache.org/viewvc/incubator/chukwa/trunk/src/site/apt/programming.apt?p2=incubator/chukwa/trunk/src/site/apt/programming.apt&p1=incubator/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml&r1=1208953&r2=1210325&rev=1210325&view=diff
==============================================================================
--- incubator/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml (original)
+++ incubator/chukwa/trunk/src/site/apt/programming.apt Mon Dec  5 04:06:44 2011
@@ -1,263 +1,256 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!--
-  Licensed to the Apache Software Foundation (ASF) under one or more
-  contributor license agreements.  See the NOTICE file distributed with
-  this work for additional information regarding copyright ownership.
-  The ASF licenses this file to You under the Apache License, Version 2.0
-  (the "License"); you may not use this file except in compliance with
-  the License.  You may obtain a copy of the License at
-
-      http://www.apache.org/licenses/LICENSE-2.0
-
-  Unless required by applicable law or agreed to in writing, software
-  distributed under the License is distributed on an "AS IS" BASIS,
-  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-  See the License for the specific language governing permissions and
-  limitations under the License.
--->
-<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" 
-"http://forrest.apache.org/dtd/document-v20.dtd">
-
-<document>
-  <header>
-    <title>Chukwa User and Programming Guide</title>
-  </header>
-  <body>
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+~~
 
-<p>
-At the core of Chukwa is a flexible system for collecting and processing
+Chukwa User and Programming Guide
+
+  At the core of Chukwa is a flexible system for collecting and processing
 monitoring data, particularly log files. This document describes how to use the
 collected data.  (For an overview of the Chukwa data model and collection 
-pipeline, see the <a href="design.html">Design Guide</a>.)  
-</p>
+pipeline, see the {{{design.html}Design Guide}}.)  
 
-<p>
-In particular, this document discusses the Chukwa archive file formats, the
-demux and archiving mapreduce jobs, and  the layout of the Chukwa storage directories.</p>
+  In particular, this document discusses the Chukwa archive file formats, the
+demux and archiving mapreduce jobs, and  the layout of the Chukwa storage directories.
 
+Reading data from the sink or the archive
 
+  Chukwa gives you several ways of inspecting or processing collected data.
 
-<section>
-<title>Reading data from the sink or the archive</title>
-<p>
-Chukwa gives you several ways of inspecting or processing collected data.
-</p>
+* Dumping some data
 
-<section><title>Dumping some data</title>
-<p>
-It very often happens that you want to retrieve one or more files that have been
+  It very often happens that you want to retrieve one or more files that have been
 collected with Chukwa. If the total volume of data to be recovered is not too
-great, you can use <code>bin/chukwa dumpArchive</code>, a command-line tool that
does the job.
-The <code>dump</code> tool does an in-memory sort of the data, so you'll be 
+great, you can use <bin/chukwa dumpArchive>, a command-line tool that does the job.
+The <dump> tool does an in-memory sort of the data, so you'll be 
 constrained by the Java heap size (typically a few hundred MB).
-</p>
 
-<p>
-The <code>dump</code> tool takes a search pattern as its first argument, followed
+  The <dump> tool takes a search pattern as its first argument, followed
 by a list of files or file-globs.  It will then print the contents of every data
 stream in those files that matches the pattern. (A data stream is a sequence of
 chunks with the same host, source, and datatype.)  Data is printed in order,
 with duplicates removed.  No metadata is printed.  Separate streams are 
 separated by a row of dashes.  
-</p>
 
-<p>For example, the following command will dump all data from every file that
+  For example, the following command will dump all data from every file that
 matches the glob pattern.  Note the use of single quotes to pass glob patterns
-through to the application, preventing the shell from expanding them.</p>
-<source>
+through to the application, preventing the shell from expanding them.
+
+---
 $CHUKWA_HOME/bin/chukwa dumpArchive 'datatype=.*' 'hdfs://host:9000/chukwa/archive/*.arc'
-</source>
+---
 
-<p>
-The patterns used by <code>dump</code> are based on normal regular 
-expressions. They are of the form <code>field1=regex&#38;field2=regex</code>.
+  The patterns used by <dump> are based on normal regular 
+expressions. They are of the form <field1=regex&field2=regex>.
 That is, they are a sequence of rules, separated by ampersand signs. Each rule
-is of the form <code>metadatafield=regex</code>, where 
-<code>metadatafield</code> is one of the Chukwa metadata fields, and 
-<code>regex</code> is a regular expression.  The valid metadata field names are:
-<code>datatype</code>, <code>host</code>, <code>cluster</code>,

-<code>content</code>, <code>name</code>.  Note that the <code>name</code>
field matches the stream name -- often the filename
+is of the form <metadatafield=regex>, where 
+<metadatafield> is one of the Chukwa metadata fields, and 
+<regex> is a regular expression.  The valid metadata field names are:
+<datatype>, <host>, <cluster>, 
+<content>, <name>.  Note that the <name> field matches the stream name
-- often the filename
 that the data was extracted from.
-</p>
 
-<p>
-In addition, you can match arbitrary tags via <code>tags.tagname</code>.
-So for instance, to match chunks with tag <code>foo="bar"</code> you could say
-<code>tags.foo=bar</code>. Note that quotes are present in the tag, but not
-in the filter rule.</p>
+  In addition, you can match arbitrary tags via <tags.tagname>.
+So for instance, to match chunks with tag <foo="bar"> you could say
+<tags.foo=bar>. Note that quotes are present in the tag, but not
+in the filter rule.
 
-<p>A stream matches the search pattern only if every rule matches. So to 
+  A stream matches the search pattern only if every rule matches. So to 
 retrieve HadoopLog data from cluster foo, you might search for 
-<code>cluster=foo&#38;datatype=HadoopLog</code>.
-</p>
-</section>
+<cluster=foo&datatype=HadoopLog>.
 
+* Exploring the Sink or Archive
 
-<section><title>Exploring the Sink or Archive</title>
-<p>
-Another common task is finding out what data has been collected. Chukwa offers
-a specialized tool for this purpose: <code>DumpArchive</code>. This tool has
+  Another common task is finding out what data has been collected. Chukwa offers
+a specialized tool for this purpose: <DumpArchive>. This tool has
 two modes: summarize and verbose, with the latter being the default.
-</p>
-<p>
-In summarize mode, <code>DumpArchive</code> prints a count of chunks in each
-data stream.  In verbose mode, the chunks themselves are dumped.</p>
-<p>
-You can invoke the tool by running <code>$CHUKWA_HOME/bin/dumpArchive.sh</code>.
-To specify summarize mode, pass <code>--summarize</code> as the first argument.
-</p>
-<source>
+
+  In summarize mode, <DumpArchive> prints a count of chunks in each
+data stream.  In verbose mode, the chunks themselves are dumped.
+
+  You can invoke the tool by running <$CHUKWA_HOME/bin/dumpArchive.sh>.
+To specify summarize mode, pass <--summarize> as the first argument.
+
+---
 bin/chukwa dumpArchive --summarize 'hdfs://host:9000/chukwa/logs/*.done'
-</source>
-</section>
+---
+
+* Using MapReduce
 
-<section><title>Using MapReduce</title>
-<p>
-A key goal of Chukwa was to facilitate MapReduce processing of collected data.
+  A key goal of Chukwa was to facilitate MapReduce processing of collected data.
 The next section discusses the file formats.  An understanding of MapReduce
-and SequenceFiles is helpful in understanding the material.</p>
-</section>
+and SequenceFiles is helpful in understanding the material.
 
-</section>
+Sink File Format
 
-<section>
-<title>Sink File Format</title>
-<p>
-As data is collected, Chukwa dumps it into <em>sink files</em> in HDFS. By
-default, these are located in <code>hdfs:///chukwa/logs</code>.  If the file
name 
+  As data is collected, Chukwa dumps it into <sink files> in HDFS. By
+default, these are located in <hdfs:///chukwa/logs>.  If the file name 
 ends in .chukwa, that means the file is still being written to. Every few minutes, 
 the collector will close the file, and rename the file to '*.done'.  This 
-marks the file as available for processing.</p>
+marks the file as available for processing.
 
-<p>
-Each sink file is a Hadoop sequence file, containing a succession of 
+  Each sink file is a Hadoop sequence file, containing a succession of 
 key-value pairs, and periodic synch markers to facilitate MapReduce access. 
-They key type is <code>ChukwaArchiveKey</code>; the value type is 
-<code>ChunkImpl</code>. See the Chukwa Javadoc for details about these classes.
-</p>
-
-<p>Data in the sink may include duplicate and omitted chunks.</p>
-</section>
-
-<section>
-<title>Demux and Archiving</title>
-<p>It's possible to write MapReduce jobs that directly examine the data sink, 
+They key type is <ChukwaArchiveKey>; the value type is 
+<ChunkImpl>. See the Chukwa Javadoc for details about these classes.
+
+  Data in the sink may include duplicate and omitted chunks.
+
+Demux and Archiving
+
+  It's possible to write MapReduce jobs that directly examine the data sink, 
 but it's not extremely convenient. Data is not organized in a useful way, so 
 jobs will likely discard most of their input. Data quality is imperfect, since 
 duplicates and omissions may exist.  And MapReduce and HDFS are optimized to 
-deal with a modest number of large files, not many small ones.</p> 
+deal with a modest number of large files, not many small ones.
 
-<p> Chukwa therefore supplies several MapReduce jobs for organizing collected 
+  Chukwa therefore supplies several MapReduce jobs for organizing collected 
 data and putting it into a more useful form; these jobs are typically run 
 regularly from cron.  Knowing how to use Chukwa-collected data requires 
 understanding how these jobs lay out storage. For now, this document only 
-discusses one such job: the Simple Archiver. </p>
-</section>
+discusses one such job: the Simple Archiver.
 
-<section><title>Simple Archiver</title>
-<p>The simple archiver is designed to consolidate a large number of data sink 
+Simple Archiver
+
+  The simple archiver is designed to consolidate a large number of data sink 
 files into a small number of archive files, with the contents grouped in a 
 useful way.  Archive files, like raw sink files, are in Hadoop sequence file 
 format. Unlike the data sink, however, duplicates have been removed.  (Future 
-versions of the Simple Archiver will indicate the presence of gaps.)</p>
+versions of the Simple Archiver will indicate the presence of gaps.)
 
-<p>The simple archiver moves every <code>.done</code> file out of the sink,
and 
+  The simple archiver moves every <.done> file out of the sink, and 
 then runs a MapReduce job to group the data. Output Chunks will be placed into 
 files with names of the form 
-<code>hdfs:///chukwa/archive/clustername/Datatype_date.arc</code>.  
+<hdfs:///chukwa/archive/clustername/Datatype_date.arc>.  
 Date corresponds to when the data was collected; Datatype is the datatype of 
 each Chunk. 
-</p>
-
-<p>If archived data corresponds to an existing filename, a new file will be 
-created with a disambiguating suffix.</p>
-
-<!-- The Simple Archiver is a Java class, stored in <code>chukwa-core-*.jar</code>
--->
 
-</section>
+  If archived data corresponds to an existing filename, a new file will be 
+created with a disambiguating suffix.
 
+Demux
 
-<section><title>Demux</title>
-
-<p>A key use for Chukwa is processing arriving data, in parallel, using MapReduce.
+  A key use for Chukwa is processing arriving data, in parallel, using MapReduce.
 The most common way to do this is using the Chukwa demux framework.
-As <a href="dataflow.html">data flows through Chukwa</a>, the demux job is often
the
+As {{{dataflow.html}data flows through Chukwa}}, the demux job is often the
 first job that runs.
-</p>
 
-<p>By default, Chukwa will use the default TsProcessor. This parser will try to
- extract the real log statement from the log entry using the ISO8601 date 
- format. If it fails, it will use the time at which the chunk was written to
- disk (collector timestamp).</p>
-
-<section>
-<title>Writing a custom demux Mapper</title>
-
-<p>If you want to extract some specific information and perform more processing you
- need to write your own parser. Like any M/R program, your have to write at least
- the Map side for your parser. The reduce side is Identity by default.</p>
-
-<p>On the Map side,you can write your own parser from scratch or extend the AbstractProcessor
class
- that hides all the low level action on the chunk. See
- <code>org.apache.hadoop.chukwa.extraction.demux.processor.mapper.Df</code> for
an example
- of a Map class for use with Demux.
- </p>
+  By default, Chukwa will use the default TsProcessor. This parser will try to
+extract the real log statement from the log entry using the ISO8601 date 
+format. If it fails, it will use the time at which the chunk was written to
+disk (collector timestamp).
+
+* Writing a custom demux Mapper
+
+  If you want to extract some specific information and perform more processing you
+need to write your own parser. Like any M/R program, your have to write at least
+the Map side for your parser. The reduce side is Identity by default.
+
+  On the Map side,you can write your own parser from scratch or extend the AbstractProcessor
class
+that hides all the low level action on the chunk. See
+<org.apache.hadoop.chukwa.extraction.demux.processor.mapper.Df> for an example
+of a Map class for use with Demux.
  
-<p>For Chukwa to invoke your Mapper code, you have
- to specify which data types it should run on.
- Edit <code>${CHUKWA_HOME}/conf/chukwa-demux-conf.xml</code> and add the following
lines:
- </p>
-<source>
-      &#60;property&#62;
-            &#60;name&#62;MyDataType&#60;/name&#62; 
-            &#60;value&#62;org.apache.hadoop.chukwa.extraction.demux.processor.mapper.MyParser&#60;/value&#62;
-            &#60;description&#62;Parser class for MyDataType.&#60;/description&#62;
-      &#60;/property&#62;
-</source>
-<p>You can use the same parser for several different recordTypes.</p>
-</section>
-
-<section><title>Writing a custom reduce</title>
-
-<p>You only need to implement a reduce side if you need to group records together.

-The interface that your need to implement is <code>ReduceProcessor</code>:
-</p>
-<source>
+  For Chukwa to invoke your Mapper code, you have
+to specify which data types it should run on.
+Edit <${CHUKWA_HOME}/etc/chukwa/chukwa-demux-conf.xml> and add the following lines:
+
+---
+<property>
+    <name>MyDataType</name>
+    <value>org.apache.hadoop.chukwa.extraction.demux.processor.mapper.MyParser</value>
+    <description>Parser class for MyDataType.</description>
+</property>
+---
+
+  You can use the same parser for several different recordTypes.
+
+* Writing a custom reduce
+
+  You only need to implement a reduce side if you need to group records together. 
+The interface that your need to implement is <ReduceProcessor>:
+
+---
 public interface ReduceProcessor
 {
            public String getDataType();
-           public void process(ChukwaRecordKey key,Iterator&#60;ChukwaRecord&#62;
values,
-                      OutputCollector&#60;ChukwaRecordKey, 
-                      ChukwaRecord&#62; output, Reporter reporter);
+           public void process(ChukwaRecordKey key,Iterator<ChukwaRecord> values,
+                               OutputCollector<ChukwaRecordKey, ChukwaRecord> output,

+                               Reporter reporter);
 }
-</source>
+---
+
+  The link between the Map side and the reduce is done by setting your reduce class
+into the reduce type: <key.setReduceType("MyReduceClass");>
+Note that in the current version of Chukwa, your class needs to be in the package
+<org.apache.hadoop.chukwa.extraction.demux.processor>
+See <org.apache.hadoop.chukwa.extraction.demux.processor.reducer.SystemMetrics>
+for an example of a Demux reducer.
 
-<p>The link between the Map side and the reduce is done by setting your reduce class
- into the reduce type: <code>key.setReduceType("MyReduceClass");</code>.
- Note that in the current version of Chukwa, your class needs to be in the package
- <code>org.apache.hadoop.chukwa.extraction.demux.processor</code>
-See <code>org.apache.hadoop.chukwa.extraction.demux.processor.reducer.SystemMetrics</code>
-for an example of a Demux reducer.</p>
-</section>
-
-<section>
-<title>Output</title>
-<p> Your data is going to be sorted by RecordType then by the key field. The default
- implementation use the following grouping for all records:</p>
-<ol>
-<li>Time partition (Time up to the hour)</li>
-<li>Machine name (physical input source)</li>
-<li>Record timestamp </li>
-</ol>
+* Output
 
-<p>The demux process will use the recordType to save similar records together 
+  Your data is going to be sorted by RecordType then by the key field. The default
+implementation use the following grouping for all records:
+
+  * Time partition (Time up to the hour)
+
+  * Machine name (physical input source)
+
+  * Record timestamp
+
+  The demux process will use the recordType to save similar records together 
 (same recordType) to the same directory: 
-<code>&#62;cluster name&#62;/&#60;record type&#62;/</code>
-</p></section>
 
-</section>
+---
+<cluster name>/<record type>/
+---
+
+* Demux Data To HBase
+
+  Demux parsers can be configured to run in Chukwa Collector.  See 
+{{{./collector.html}Collector configuration guide}}.  HBaseWriter is not a
+real map reduce job.  It is designed to reuse Demux parsers for extraction,
+transformation and load purpose.  There are some limitations to consider before implementing
+Demux parser for loading data to HBase.  In MapReduce job, mutliple values can be merged
and 
+group into a key/value pair in shuffle/combine and merge phases.  This kind of aggregation
is 
+unsupported by Demux in HBaseWriter because the data are not merged in memory, but send to
HBase.
+HBase takes the role of merging values into a record by primary key.  Therefore, Demux
+reducer parser is not invoked by HBaseWriter.
+
+  For writing a demux parser that works with HBaseWriter, there are two piece information
to
+encode to Demux parser.  First, HBase table name to store the data.  This is encoded in Demux
+parser by annotation.  Second, the column family name to store the data is encoded in the

+ReducerType of the Demux Reducer parser.
+
+** Example of Demux mapper parser
+
+---
+@Tables(annotations={
+    @Table(name="SystemMetrics",columnFamily="cpu)
+})
+public class SystemMetrics extends AbstractProcessor {
+  @Override
+  protected void parse(String recordEntry,
+      OutputCollector<ChukwaRecordKey, ChukwaRecord> output, Reporter reporter)
+      throws Throwable {
+    ...
+    buildGenericRecord(record, null, cal.getTimeInMillis(), "cpu");
+    output.collect(key, record);
+  }
+}
+---
 
+  In this example, the data collected by SystemMetrics parser is stored into <"SystemMetrics">
+HBase table, and column family is stored to <"cpu"> column family.
 
-</body>
-</document>
\ No newline at end of file



Mime
View raw message