chukwa-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From asrab...@apache.org
Subject svn commit: r800642 - in /hadoop/chukwa/trunk: CHANGES.txt bin/dumpArchive.sh src/docs/src/documentation/content/xdocs/programming.xml src/java/org/apache/hadoop/chukwa/util/DumpArchive.java
Date Tue, 04 Aug 2009 00:36:42 GMT
Author: asrabkin
Date: Tue Aug  4 00:36:42 2009
New Revision: 800642

URL: http://svn.apache.org/viewvc?rev=800642&view=rev
Log:
CHUKWA-365.  Improved DumpArchive tool

Modified:
    hadoop/chukwa/trunk/CHANGES.txt
    hadoop/chukwa/trunk/bin/dumpArchive.sh
    hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml
    hadoop/chukwa/trunk/src/java/org/apache/hadoop/chukwa/util/DumpArchive.java

Modified: hadoop/chukwa/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/CHANGES.txt?rev=800642&r1=800641&r2=800642&view=diff
==============================================================================
--- hadoop/chukwa/trunk/CHANGES.txt (original)
+++ hadoop/chukwa/trunk/CHANGES.txt Tue Aug  4 00:36:42 2009
@@ -44,6 +44,8 @@
 
   IMPROVEMENTS
 
+    CHUKWA-365.  Improved DumpArchive tool. (asrabkin)
+
     CHUKWA-364.  Design and Architecture document in documentation. (asrabkin)
 
     CHUKWA-333.  Copy release notes from 0.2 forward to Trunk. (asrabkin)

Modified: hadoop/chukwa/trunk/bin/dumpArchive.sh
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/bin/dumpArchive.sh?rev=800642&r1=800641&r2=800642&view=diff
==============================================================================
--- hadoop/chukwa/trunk/bin/dumpArchive.sh (original)
+++ hadoop/chukwa/trunk/bin/dumpArchive.sh Tue Aug  4 00:36:42 2009
@@ -21,4 +21,4 @@
 
 . "$bin"/chukwa-config.sh
 
-${JAVA_HOME}/bin/java -Djava.library.path=${JAVA_LIBRARY_PATH} -DCHUKWA_CONF_DIR=${CHUKWA_CONF_DIR}
-classpath ${CHUKWA_CONF_DIR}:${HADOOP_CONF_DIR}:${CLASSPATH}:${CHUKWA_CORE}:${COMMON}:${HADOOP_JAR}
org.apache.hadoop.chukwa.util.DumpArchive $1
+${JAVA_HOME}/bin/java -Djava.library.path=${JAVA_LIBRARY_PATH} -DCHUKWA_CONF_DIR=${CHUKWA_CONF_DIR}
-classpath ${CHUKWA_CONF_DIR}:${HADOOP_CONF_DIR}:${CLASSPATH}:${CHUKWA_CORE}:${COMMON}:${HADOOP_JAR}
org.apache.hadoop.chukwa.util.DumpArchive $@

Modified: hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml?rev=800642&r1=800641&r2=800642&view=diff
==============================================================================
--- hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml (original)
+++ hadoop/chukwa/trunk/src/docs/src/documentation/content/xdocs/programming.xml Tue Aug 
4 00:36:42 2009
@@ -35,12 +35,90 @@
 In particular, this document discusses the Chukwa archive file formats, and 
 the layout of the Chukwa storage directories.</p>
 
+
+
+<section>
+<title>Reading data from the sink or the archive</title>
+<p>
+Chukwa gives you several ways of inspecting or processing collected data.
+</p>
+
+<section><title>Dumping some data</title>
+<p>
+It very often happens that you want to retrieve one or more files that have been
+collected with Chukwa. If the total volume of data to be recovered is not too
+great, you can use <code>dump.sh</code>, a command-line tool that does the job.
+The <code>dump</code> tool does an in-memory sort of the data, so you'll be 
+constrained by the Java heap size (typically a few hundred MB).
+</p>
+
+<p>
+The <code>dump</code> tool takes a search pattern as its first argument, followed
+by a list of files or file-globs.  It will then print the contents of every data
+stream in those files that matches the pattern. (A data stream is a sequence of
+chunks with the same host, source, and datatype.)  Data is printed in order,
+with duplicates removed.  No metadata is printed.  Separate streams are 
+separated by a row of dashes.  
+</p>
+
+<p>For example, the following command will dump all data from every file that
+matches the glob pattern.  Note the use of single quotes to pass glob patterns
+through to the application, preventing the shell from expanding them.</p>
+<source>
+$CHUKWA_HOME/bin/dump.sh 'datatype=.*' 'hdfs://host:9000/chukwa/archive/*.arc'
+</source>
+
+<p>
+The patterns used by <code>dump.sh</code> are based on normal regular 
+expressions. They are of the form <code>field1=regex&#38;field2=regex</code>.
+That is, they are a sequence of rules, separated by ampersand signs. Each rule
+is of the form <code>metadatafield=regex</code>, where 
+<code>metadatafield</code> is one of the Chukwa metadata fields, and 
+<code>regex</code> is a regular expression.  The valid metadata field names are:
+<code>datatype</code>, <code>host</code>, <code>cluster</code>,

+<code>content</code> and <code>name</code>.  
+</p>
+
+<p>A stream matches the search pattern only if every rule matches. So to 
+retrieve HadoopLog data from cluster foo, you might search for 
+<code>cluster=foo&#38;datatype=HadoopLog</code>.
+</p>
+</section>
+
+
+<section><title>Exploring the Sink or Archive</title>
+<p>
+Another common task is finding out what data has been collected. Chukwa offers
+a specialized tool for this purpose: <code>DumpArchive</code>. This tool has
+two modes: summarize and verbose, with the latter being the default.
+</p>
+<p>
+In summarize mode, <code>DumpArchive</code> prints a count of chunks in each
+data stream.  In verbose mode, the chunks themselves are dumped.</p>
+<p>
+You can invoke the tool by running <code>$CHUKWA_HOME/bin/dumpArchive.sh</code>.
+To specify summarize mode, pass <code>--summarize</code> as the first argument.
+</p>
+<source>
+bin/dumpArchive.sh --summarize 'hdfs://host:9000/chukwa/logs/*.done'
+</source>
+</section>
+
+<section><title>Using MapReduce</title>
+<p>
+A key goal of Chukwa was to facilitate MapReduce processing of collected data.
+The next section discusses the file formats.  An understanding of MapReduce
+and SequenceFiles is helpful in understanding the material.</p>
+</section>
+
+</section>
+
 <section>
 <title>Sink File Format</title>
 <p>
 As data is collected, Chukwa dumps it into <em>sink files</em> in HDFS. By
-default, these are located in <code>hdfs:///chukwa/logs</code>.  If the file
name ends
-in .chukwa, that means the file is still being written to. Every few minutes, 
+default, these are located in <code>hdfs:///chukwa/logs</code>.  If the file
name 
+ends in .chukwa, that means the file is still being written to. Every few minutes, 
 the collector will close the file, and rename the file to '*.done'.  This 
 marks the file as available for processing.</p>
 
@@ -92,48 +170,6 @@
 
 </section>
 
-<section>
-<title>Reading data from the sink or the archive</title>
-<p>
-It very often happens that you want to retrieve one or more files that have been
-collected with Chukwa. If the total volume of data to be recovered is not too
-great, you can use <code>dump.sh</code>, a command-line tool that does the job.
-</p>
-
-<p>
-The <code>dump</code> tool takes a search pattern as its first argument, followed
-by a list of files or file-globs.  It will then print the contents of every data
-stream in those files that matches the pattern. (A data stream is a sequence of
-chunks with the same host, source, and datatype.)  Data is printed in order,
-with duplicates removed.  No metadata is printed.  Separate streams are 
-separated by a row of dashes.  
-</p>
-
-<p>For example, the following command will dump all data from every file that
-matches the glob pattern.  Note the use of single quotes to pass glob patterns
-through to the application, preventing the shell from expanding them.</p>
-<source>
-$CHUKWA_HOME/bin/dump.sh 'datatype=.*' 'hdfs://host:9000/chukwa/archive/*.arc'
-</source>
-
-<p>
-The patterns used by <code>dump.sh</code> are based on normal regular 
-expressions. They are of the form <code>field1=regex&#38;field2=regex</code>.
-That is, they are a sequence of rules, separated by ampersand signs. Each rule
-is of the form <code>metadatafield=regex</code>, where 
-<code>metadatafield</code> is one of the Chukwa metadata fields, and 
-<code>regex</code> is a regular expression.  The valid metadata field names are:
-<code>datatype</code>, <code>host</code>, <code>cluster</code>,

-<code>content</code> and <code>name</code>.  
-</p>
-
-<p>A stream matches the search pattern only if every rule matches. So to 
-retrieve HadoopLog data from cluster foo, you might search for 
-<code>cluster=foo&#38;datatype=HadoopLog</code>.
-</p>
-
-
-</section>
 
 </body>
 </document>
\ No newline at end of file

Modified: hadoop/chukwa/trunk/src/java/org/apache/hadoop/chukwa/util/DumpArchive.java
URL: http://svn.apache.org/viewvc/hadoop/chukwa/trunk/src/java/org/apache/hadoop/chukwa/util/DumpArchive.java?rev=800642&r1=800641&r2=800642&view=diff
==============================================================================
--- hadoop/chukwa/trunk/src/java/org/apache/hadoop/chukwa/util/DumpArchive.java (original)
+++ hadoop/chukwa/trunk/src/java/org/apache/hadoop/chukwa/util/DumpArchive.java Tue Aug  4
00:36:42 2009
@@ -1,53 +1,134 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
 package org.apache.hadoop.chukwa.util;
 
 
 import java.io.IOException;
 import java.net.URI;
 import java.net.URISyntaxException;
+import java.util.*;
 import org.apache.hadoop.chukwa.ChukwaArchiveKey;
 import org.apache.hadoop.chukwa.ChunkImpl;
 import org.apache.hadoop.chukwa.conf.ChukwaConfiguration;
 import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.FileUtil;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.SequenceFile;
+import org.apache.hadoop.conf.Configuration;
 
+/**
+ * Tool for exploring the contents of the Chukwa data archive, or a collection
+ * of Chukwa sequence files.
+ * 
+ * Limitation: DumpArchive infers the filesystem to dump from based on the first
+ * path argument, and will behave strangely if you try to dump files
+ * from different filesystems in the same invocation.
+ *
+ */
 public class DumpArchive {
 
+  static boolean summarize = false;
+  
+  static HashMap<String, Integer> counts  = new LinkedHashMap<String, Integer>();
   /**
    * @param args
    * @throws URISyntaxException
    * @throws IOException
    */
   public static void main(String[] args) throws IOException, URISyntaxException {
-    System.out.println("Input file:" + args[0]);
 
+    int firstArg = 0;
+    if(args.length == 0) {
+      System.out.println("Usage: DumpArchive [--summarize] <sequence files>");
+    }
+    if(args[0].equals("--summarize")) {
+      firstArg = 1;
+      summarize= true;
+    } 
     ChukwaConfiguration conf = new ChukwaConfiguration();
-    String fsName = conf.get("writer.hdfs.filesystem");
-    FileSystem fs = FileSystem.get(new URI(fsName), conf);
+    FileSystem fs;
+    if(args[firstArg].contains("://")) {
+      fs = FileSystem.get(new URI(args[firstArg]), conf);
+    } else {
+      String fsName = conf.get("writer.hdfs.filesystem");
+      if(fsName != null)
+        fs = FileSystem.get(conf);
+      else
+        fs = FileSystem.getLocal(conf);
+    }
+    ArrayList<Path> filesToSearch = new ArrayList<Path>();
+    for(int i=firstArg; i < args.length; ++i){
+      Path[] globbedPaths = FileUtil.stat2Paths(fs.globStatus(new Path(args[i])));
+      for(Path p: globbedPaths)
+        filesToSearch.add(p);
+    }
+    int tot = filesToSearch.size();
+    int i=1;
+
+    System.err.println("total of " + tot + " files to search");
+    for(Path p: filesToSearch) {
+      System.err.println("scanning " + p.toUri() + "("+ (i++) +"/"+tot+")");
+      dumpFile(p, conf, fs);
+    }
 
-    SequenceFile.Reader r = new SequenceFile.Reader(fs, new Path(args[0]), conf);
+    if(summarize) {
+      for(Map.Entry<String, Integer> count: counts.entrySet()) {
+        System.out.println(count.getKey()+ ")   ===> " + count.getValue());
+      }
+    }
+  }
+
+  private static void dumpFile(Path p, Configuration conf,
+      FileSystem fs) throws IOException {
+    SequenceFile.Reader r = new SequenceFile.Reader(fs, p, conf);
 
     ChukwaArchiveKey key = new ChukwaArchiveKey();
     ChunkImpl chunk = ChunkImpl.getBlankChunk();
     try {
       while (r.next(key, chunk)) {
-        System.out.println("\nTimePartition: " + key.getTimePartition());
-        System.out.println("DataType: " + key.getDataType());
-        System.out.println("StreamName: " + key.getStreamName());
-        System.out.println("SeqId: " + key.getSeqId());
-        System.out.println("\t\t =============== ");
-
-        System.out.println("Cluster : " + chunk.getTags());
-        System.out.println("DataType : " + chunk.getDataType());
-        System.out.println("Source : " + chunk.getSource());
-        System.out.println("Application : " + chunk.getApplication());
-        System.out.println("SeqID : " + chunk.getSeqID());
-        System.out.println("Data : " + new String(chunk.getData()));
+        
+        String entryKey = chunk.getSource() +":"+chunk.getDataType() +":" +
+        chunk.getApplication();
+        
+        Integer oldC = counts.get(entryKey);
+        if(oldC != null)
+          counts.put(entryKey, oldC + 1);
+        else
+          counts.put(entryKey, new Integer(1));
+        
+        if(!summarize) {
+          System.out.println("\nTimePartition: " + key.getTimePartition());
+          System.out.println("DataType: " + key.getDataType());
+          System.out.println("StreamName: " + key.getStreamName());
+          System.out.println("SeqId: " + key.getSeqId());
+          System.out.println("\t\t =============== ");
+  
+          System.out.println("Cluster : " + chunk.getTags());
+          System.out.println("DataType : " + chunk.getDataType());
+          System.out.println("Source : " + chunk.getSource());
+          System.out.println("Application : " + chunk.getApplication());
+          System.out.println("SeqID : " + chunk.getSeqID());
+          System.out.println("Data : " + new String(chunk.getData()));
+        }
       }
     } catch (Exception e) {
       e.printStackTrace();
     }
-
   }
 
 }



Mime
View raw message