hadoop-hdfs-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a..@apache.org
Subject svn commit: r1440245 [1/2] - in /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src: main/docs/src/documentation/content/xdocs/ site/apt/
Date Wed, 30 Jan 2013 01:52:15 GMT
Author: atm
Date: Wed Jan 30 01:52:14 2013
New Revision: 1440245

URL: http://svn.apache.org/viewvc?rev=1440245&view=rev
Log:
HADOOP-9221. Convert remaining xdocs to APT. Contributed by Andy Isaacson.

Added:
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/SLGUserGuide.apt.vm
Removed:
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/SLG_user_guide.xml
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/faultinject_framework.xml
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_editsviewer.xml
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_imageviewer.xml
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_permissions_guide.xml
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_quota_admin_guide.xml
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hdfs_user_guide.xml
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/hftp.xml
    hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/docs/src/documentation/content/xdocs/libhdfs.xml

Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/FaultInjectFramework.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,312 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+  Fault Injection Framework and Development Guide
+  ---
+  ---
+  ${maven.build.timestamp}
+
+Fault Injection Framework and Development Guide
+
+%{toc|section=1|fromDepth=0}
+
+* Introduction
+
+   This guide provides an overview of the Hadoop Fault Injection (FI)
+   framework for those who will be developing their own faults (aspects).
+
+   The idea of fault injection is fairly simple: it is an infusion of
+   errors and exceptions into an application's logic to achieve a higher
+   coverage and fault tolerance of the system. Different implementations
+   of this idea are available today. Hadoop's FI framework is built on top
+   of Aspect Oriented Paradigm (AOP) implemented by AspectJ toolkit.
+
+* Assumptions
+
+   The current implementation of the FI framework assumes that the faults
+   it will be emulating are of non-deterministic nature. That is, the
+   moment of a fault's happening isn't known in advance and is a coin-flip
+   based.
+
+* Architecture of the Fault Injection Framework
+
+   Components layout
+
+** Configuration Management
+
+   This piece of the FI framework allows you to set expectations for
+   faults to happen. The settings can be applied either statically (in
+   advance) or in runtime. The desired level of faults in the framework
+   can be configured two ways:
+
+     * editing src/aop/fi-site.xml configuration file. This file is
+       similar to other Hadoop's config files
+
+     * setting system properties of JVM through VM startup parameters or
+       in build.properties file
+
+** Probability Model
+
+   This is fundamentally a coin flipper. The methods of this class are
+   getting a random number between 0.0 and 1.0 and then checking if a new
+   number has happened in the range of 0.0 and a configured level for the
+   fault in question. If that condition is true then the fault will occur.
+
+   Thus, to guarantee the happening of a fault one needs to set an
+   appropriate level to 1.0. To completely prevent a fault from happening
+   its probability level has to be set to 0.0.
+
+   Note: The default probability level is set to 0 (zero) unless the level
+   is changed explicitly through the configuration file or in the runtime.
+   The name of the default level's configuration parameter is fi.*
+
+** Fault Injection Mechanism: AOP and AspectJ
+
+   The foundation of Hadoop's FI framework includes a cross-cutting
+   concept implemented by AspectJ. The following basic terms are important
+   to remember:
+
+     * A cross-cutting concept (aspect) is behavior, and often data, that
+       is used across the scope of a piece of software
+
+     * In AOP, the aspects provide a mechanism by which a cross-cutting
+       concern can be specified in a modular way
+
+     * Advice is the code that is executed when an aspect is invoked
+
+     * Join point (or pointcut) is a specific point within the application
+       that may or not invoke some advice
+
+** Existing Join Points
+
+   The following readily available join points are provided by AspectJ:
+
+     * Join when a method is called
+
+     * Join during a method's execution
+
+     * Join when a constructor is invoked
+
+     * Join during a constructor's execution
+
+     * Join during aspect advice execution
+
+     * Join before an object is initialized
+
+     * Join during object initialization
+
+     * Join during static initializer execution
+
+     * Join when a class's field is referenced
+
+     * Join when a class's field is assigned
+
+     * Join when a handler is executed
+
+* Aspect Example
+
+----
+    package org.apache.hadoop.hdfs.server.datanode;
+
+    import org.apache.commons.logging.Log;
+    import org.apache.commons.logging.LogFactory;
+    import org.apache.hadoop.fi.ProbabilityModel;
+    import org.apache.hadoop.hdfs.server.datanode.DataNode;
+    import org.apache.hadoop.util.DiskChecker.*;
+
+    import java.io.IOException;
+    import java.io.OutputStream;
+    import java.io.DataOutputStream;
+
+    /**
+     * This aspect takes care about faults injected into datanode.BlockReceiver
+     * class
+     */
+    public aspect BlockReceiverAspects {
+      public static final Log LOG = LogFactory.getLog(BlockReceiverAspects.class);
+
+      public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
+        pointcut callReceivePacket() : call (* OutputStream.write(..))
+          && withincode (* BlockReceiver.receivePacket(..))
+        // to further limit the application of this aspect a very narrow 'target' can be used as follows
+        // && target(DataOutputStream)
+          && !within(BlockReceiverAspects +);
+
+      before () throws IOException : callReceivePacket () {
+        if (ProbabilityModel.injectCriteria(BLOCK_RECEIVER_FAULT)) {
+          LOG.info("Before the injection point");
+          Thread.dumpStack();
+          throw new DiskOutOfSpaceException ("FI: injected fault point at " +
+          thisJoinPoint.getStaticPart( ).getSourceLocation());
+        }
+      }
+    }
+----
+
+   The aspect has two main parts:
+
+     * The join point pointcut callReceivepacket() which servers as an
+       identification mark of a specific point (in control and/or data
+       flow) in the life of an application.
+
+     * A call to the advice - before () throws IOException :
+       callReceivepacket() - will be injected (see Putting It All
+       Together) before that specific spot of the application's code.
+
+   The pointcut identifies an invocation of class' java.io.OutputStream
+   write() method with any number of parameters and any return type. This
+   invoke should take place within the body of method receivepacket() from
+   classBlockReceiver. The method can have any parameters and any return
+   type. Possible invocations of write() method happening anywhere within
+   the aspect BlockReceiverAspects or its heirs will be ignored.
+
+   Note 1: This short example doesn't illustrate the fact that you can
+   have more than a single injection point per class. In such a case the
+   names of the faults have to be different if a developer wants to
+   trigger them separately.
+
+   Note 2: After the injection step (see Putting It All Together) you can
+   verify that the faults were properly injected by searching for ajc
+   keywords in a disassembled class file.
+
+* Fault Naming Convention and Namespaces
+
+   For the sake of a unified naming convention the following two types of
+   names are recommended for a new aspects development:
+
+     * Activity specific notation (when we don't care about a particular
+       location of a fault's happening). In this case the name of the
+       fault is rather abstract: fi.hdfs.DiskError
+
+     * Location specific notation. Here, the fault's name is mnemonic as
+       in: fi.hdfs.datanode.BlockReceiver[optional location details]
+
+* Development Tools
+
+     * The Eclipse AspectJ Development Toolkit may help you when
+       developing aspects
+
+     * IntelliJ IDEA provides AspectJ weaver and Spring-AOP plugins
+
+* Putting It All Together
+
+   Faults (aspects) have to injected (or woven) together before they can
+   be used. Follow these instructions:
+     * To weave aspects in place use:
+
+----
+    % ant injectfaults
+----
+
+     * If you misidentified the join point of your aspect you will see a
+       warning (similar to the one shown here) when 'injectfaults' target
+       is completed:
+
+----
+    [iajc] warning at
+    src/test/aop/org/apache/hadoop/hdfs/server/datanode/ \
+              BlockReceiverAspects.aj:44::0
+    advice defined in org.apache.hadoop.hdfs.server.datanode.BlockReceiverAspects
+    has not been applied [Xlint:adviceDidNotMatch]
+----
+
+     * It isn't an error, so the build will report the successful result.
+       To prepare dev.jar file with all your faults weaved in place
+       (HDFS-475 pending) use:
+
+----
+    % ant jar-fault-inject
+----
+
+     * To create test jars use:
+
+----
+    % ant jar-test-fault-inject
+----
+
+     * To run HDFS tests with faults injected use:
+
+----
+    % ant run-test-hdfs-fault-inject
+----
+
+** How to Use the Fault Injection Framework
+
+   Faults can be triggered as follows:
+
+     * During runtime:
+
+----
+    % ant run-test-hdfs -Dfi.hdfs.datanode.BlockReceiver=0.12
+----
+
+       To set a certain level, for example 25%, of all injected faults
+       use:
+
+----
+    % ant run-test-hdfs-fault-inject -Dfi.*=0.25
+----
+
+     * From a program:
+
+----
+    package org.apache.hadoop.fs;
+
+    import org.junit.Test;
+    import org.junit.Before;
+
+    public class DemoFiTest {
+      public static final String BLOCK_RECEIVER_FAULT="hdfs.datanode.BlockReceiver";
+      @Override
+      @Before
+      public void setUp() {
+        //Setting up the test's environment as required
+      }
+
+      @Test
+      public void testFI() {
+        // It triggers the fault, assuming that there's one called 'hdfs.datanode.BlockReceiver'
+        System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.12");
+        //
+        // The main logic of your tests goes here
+        //
+        // Now set the level back to 0 (zero) to prevent this fault from happening again
+        System.setProperty("fi." + BLOCK_RECEIVER_FAULT, "0.0");
+        // or delete its trigger completely
+        System.getProperties().remove("fi." + BLOCK_RECEIVER_FAULT);
+      }
+
+      @Override
+      @After
+      public void tearDown() {
+        //Cleaning up test test environment
+      }
+    }
+----
+
+   As you can see above these two methods do the same thing. They are
+   setting the probability level of <<<hdfs.datanode.BlockReceiver>>> at 12%.
+   The difference, however, is that the program provides more flexibility
+   and allows you to turn a fault off when a test no longer needs it.
+
+* Additional Information and Contacts
+
+   These two sources of information are particularly interesting and worth
+   reading:
+
+     * {{http://www.eclipse.org/aspectj/doc/next/devguide/}}
+
+     * AspectJ Cookbook (ISBN-13: 978-0-596-00654-9)
+
+   If you have additional comments or questions for the author check
+   {{{https://issues.apache.org/jira/browse/HDFS-435}HDFS-435}}.

Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsEditsViewer.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,106 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+
+  ---
+  Offline Edits Viewer Guide
+  ---
+  Erik Steffl
+  ---
+  ${maven.build.timestamp}
+
+Offline Edits Viewer Guide
+
+  \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+   Offline Edits Viewer is a tool to parse the Edits log file. The current
+   processors are mostly useful for conversion between different formats,
+   including XML which is human readable and easier to edit than native
+   binary format.
+
+   The tool can parse the edits formats -18 (roughly Hadoop 0.19) and
+   later. The tool operates on files only, it does not need Hadoop cluster
+   to be running.
+
+   Input formats supported:
+
+     [[1]] <<binary>>: native binary format that Hadoop uses internally
+
+     [[2]] <<xml>>: XML format, as produced by xml processor, used if filename
+     has <<<.xml>>> (case insensitive) extension
+
+   The Offline Edits Viewer provides several output processors (unless
+   stated otherwise the output of the processor can be converted back to
+   original edits file):
+
+     [[1]] <<binary>>: native binary format that Hadoop uses internally
+
+     [[2]] <<xml>>: XML format
+
+     [[3]] <<stats>>: prints out statistics, this cannot be converted back to
+     Edits file
+
+* Usage
+
+----
+   bash$ bin/hdfs oev -i edits -o edits.xml
+----
+
+*-----------------------:-----------------------------------+
+| Flag                  | Description                       |
+*-----------------------:-----------------------------------+
+|[<<<-i>>> ; <<<--inputFile>>>] <input file> | Specify the input edits log file to
+|                       | process. Xml (case insensitive) extension means XML format otherwise
+|                       | binary format is assumed. Required.
+*-----------------------:-----------------------------------+
+|[<<-o>> ; <<--outputFile>>] <output file> | Specify the output filename, if the
+|                       | specified output processor generates one. If the specified file already
+|                       | exists, it is silently overwritten. Required.
+*-----------------------:-----------------------------------+
+|[<<-p>> ; <<--processor>>] <processor> | Specify the image processor to apply
+|                       | against the image file. Currently valid options are
+|                       | <<<binary>>>, <<<xml>>> (default) and <<<stats>>>.
+*-----------------------:-----------------------------------+
+|<<[-v ; --verbose] >>   | Print the input and output filenames and pipe output of
+|                       | processor to console as well as specified file. On extremely large
+|                       | files, this may increase processing time by an order of magnitude.
+*-----------------------:-----------------------------------+
+|<<[-h ; --help] >>      | Display the tool usage and help information and exit.
+*-----------------------:-----------------------------------+
+
+* Case study: Hadoop cluster recovery
+
+   In case there is some problem with hadoop cluster and the edits file is
+   corrupted it is possible to save at least part of the edits file that
+   is correct. This can be done by converting the binary edits to XML,
+   edit it manually and then convert it back to binary. The most common
+   problem is that the edits file is missing the closing record (record
+   that has opCode -1). This should be recognized by the tool and the XML
+   format should be properly closed.
+
+   If there is no closing record in the XML file you can add one after
+   last correct record. Anything after the record with opCode -1 is
+   ignored.
+
+   Example of a closing record (with opCode -1):
+
++----
+  <RECORD>
+    <OPCODE>-1</OPCODE>
+    <DATA>
+    </DATA>
+  </RECORD>
++----

Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsImageViewer.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,418 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+  Offline Image Viewer Guide
+  ---
+  ---
+  ${maven.build.timestamp}
+
+Offline Image Viewer Guide
+
+  \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+   The Offline Image Viewer is a tool to dump the contents of hdfs fsimage
+   files to human-readable formats in order to allow offline analysis and
+   examination of an Hadoop cluster's namespace. The tool is able to
+   process very large image files relatively quickly, converting them to
+   one of several output formats. The tool handles the layout formats that
+   were included with Hadoop versions 16 and up. If the tool is not able
+   to process an image file, it will exit cleanly. The Offline Image
+   Viewer does not require an Hadoop cluster to be running; it is entirely
+   offline in its operation.
+
+   The Offline Image Viewer provides several output processors:
+
+   [[1]] Ls is the default output processor. It closely mimics the format of
+      the lsr command. It includes the same fields, in the same order, as
+      lsr : directory or file flag, permissions, replication, owner,
+      group, file size, modification date, and full path. Unlike the lsr
+      command, the root path is included. One important difference
+      between the output of the lsr command this processor, is that this
+      output is not sorted by directory name and contents. Rather, the
+      files are listed in the order in which they are stored in the
+      fsimage file. Therefore, it is not possible to directly compare the
+      output of the lsr command this this tool. The Ls processor uses
+      information contained within the Inode blocks to calculate file
+      sizes and ignores the -skipBlocks option.
+
+   [[2]] Indented provides a more complete view of the fsimage's contents,
+      including all of the information included in the image, such as
+      image version, generation stamp and inode- and block-specific
+      listings. This processor uses indentation to organize the output
+      into a hierarchal manner. The lsr format is suitable for easy human
+      comprehension.
+
+   [[3]] Delimited provides one file per line consisting of the path,
+      replication, modification time, access time, block size, number of
+      blocks, file size, namespace quota, diskspace quota, permissions,
+      username and group name. If run against an fsimage that does not
+      contain any of these fields, the field's column will be included,
+      but no data recorded. The default record delimiter is a tab, but
+      this may be changed via the -delimiter command line argument. This
+      processor is designed to create output that is easily analyzed by
+      other tools, such as [36]Apache Pig. See the [37]Analyzing Results
+      section for further information on using this processor to analyze
+      the contents of fsimage files.
+
+   [[4]] XML creates an XML document of the fsimage and includes all of the
+      information within the fsimage, similar to the lsr processor. The
+      output of this processor is amenable to automated processing and
+      analysis with XML tools. Due to the verbosity of the XML syntax,
+      this processor will also generate the largest amount of output.
+
+   [[5]] FileDistribution is the tool for analyzing file sizes in the
+      namespace image. In order to run the tool one should define a range
+      of integers [0, maxSize] by specifying maxSize and a step. The
+      range of integers is divided into segments of size step: [0, s[1],
+      ..., s[n-1], maxSize], and the processor calculates how many files
+      in the system fall into each segment [s[i-1], s[i]). Note that
+      files larger than maxSize always fall into the very last segment.
+      The output file is formatted as a tab separated two column table:
+      Size and NumFiles. Where Size represents the start of the segment,
+      and numFiles is the number of files form the image which size falls
+      in this segment.
+
+* Usage
+
+** Basic
+
+   The simplest usage of the Offline Image Viewer is to provide just an
+   input and output file, via the -i and -o command-line switches:
+
+----
+   bash$ bin/hdfs oiv -i fsimage -o fsimage.txt
+----
+
+   This will create a file named fsimage.txt in the current directory
+   using the Ls output processor. For very large image files, this process
+   may take several minutes.
+
+   One can specify which output processor via the command-line switch -p.
+   For instance:
+
+----
+   bash$ bin/hdfs oiv -i fsimage -o fsimage.xml -p XML
+----
+
+   or
+
+----
+   bash$ bin/hdfs oiv -i fsimage -o fsimage.txt -p Indented
+----
+
+   This will run the tool using either the XML or Indented output
+   processor, respectively.
+
+   One command-line option worth considering is -skipBlocks, which
+   prevents the tool from explicitly enumerating all of the blocks that
+   make up a file in the namespace. This is useful for file systems that
+   have very large files. Enabling this option can significantly decrease
+   the size of the resulting output, as individual blocks are not
+   included. Note, however, that the Ls processor needs to enumerate the
+   blocks and so overrides this option.
+
+Example
+
+   Consider the following contrived namespace:
+
+----
+   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:17 /anotherDir
+   -rw-r--r--   3 theuser supergroup  286631664 2009-03-16 21:15 /anotherDir/biggerfile
+   -rw-r--r--   3 theuser supergroup       8754 2009-03-16 21:17 /anotherDir/smallFile
+   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:11 /mapredsystem
+   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:11 /mapredsystem/theuser
+   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem
+   drwx-wx-wx   - theuser supergroup          0 2009-03-16 21:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
+   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:12 /one
+   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:12 /one/two
+   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:16 /user
+   drwxr-xr-x   - theuser supergroup          0 2009-03-16 21:19 /user/theuser
+----
+
+   Applying the Offline Image Processor against this file with default
+   options would result in the following output:
+
+----
+   machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -o fsimage.txt
+
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:16 /
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:17 /anotherDir
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:11 /mapredsystem
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:12 /one
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:16 /user
+   -rw-r--r--  3   theuser supergroup    286631664 2009-03-16 14:15 /anotherDir/biggerfile
+   -rw-r--r--  3   theuser supergroup         8754 2009-03-16 14:17 /anotherDir/smallFile
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:11 /mapredsystem/theuser
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem
+   drwx-wx-wx  -   theuser supergroup            0 2009-03-16 14:11 /mapredsystem/theuser/mapredsystem/ip.redacted.com
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:12 /one/two
+   drwxr-xr-x  -   theuser supergroup            0 2009-03-16 14:19 /user/theuser
+----
+
+   Similarly, applying the Indented processor would generate output that
+   begins with:
+
+----
+   machine:hadoop-0.21.0-dev theuser$ bin/hdfs oiv -i fsimagedemo -p Indented -o fsimage.txt
+
+   FSImage
+     ImageVersion = -19
+     NamespaceID = 2109123098
+     GenerationStamp = 1003
+     INodes [NumInodes = 12]
+       Inode
+         INodePath =
+         Replication = 0
+         ModificationTime = 2009-03-16 14:16
+         AccessTime = 1969-12-31 16:00
+         BlockSize = 0
+         Blocks [NumBlocks = -1]
+         NSQuota = 2147483647
+         DSQuota = -1
+         Permissions
+           Username = theuser
+           GroupName = supergroup
+           PermString = rwxr-xr-x
+   ...remaining output omitted...
+----
+
+* Options
+
+*-----------------------:-----------------------------------+
+| <<Flag>>              | <<Description>>                   |
+*-----------------------:-----------------------------------+
+| <<<-i>>>\|<<<--inputFile>>> <input file> | Specify the input fsimage file to
+|                       | process. Required.
+*-----------------------:-----------------------------------+
+| <<<-o>>>\|<<<--outputFile>>> <output file> | Specify the output filename, if the
+|                       | specified output processor generates one. If the specified file already
+|                       | exists, it is silently overwritten. Required.
+*-----------------------:-----------------------------------+
+| <<<-p>>>\|<<<--processor>>> <processor> | Specify the image processor to apply
+|                       | against the image file. Currently valid options are Ls (default), XML
+|                       | and Indented..
+*-----------------------:-----------------------------------+
+| <<<-skipBlocks>>>     | Do not enumerate individual blocks within files. This may
+|                       | save processing time and outfile file space on namespaces with very
+|                       | large files. The Ls processor reads the blocks to correctly determine
+|                       | file sizes and ignores this option.
+*-----------------------:-----------------------------------+
+| <<<-printToScreen>>>  | Pipe output of processor to console as well as specified
+|                       | file. On extremely large namespaces, this may increase processing time
+|                       | by an order of magnitude.
+*-----------------------:-----------------------------------+
+| <<<-delimiter>>> <arg>| When used in conjunction with the Delimited processor,
+|                       | replaces the default tab delimiter with the string specified by arg.
+*-----------------------:-----------------------------------+
+| <<<-h>>>\|<<<--help>>>| Display the tool usage and help information and exit.
+*-----------------------:-----------------------------------+
+
+* Analyzing Results
+
+   The Offline Image Viewer makes it easy to gather large amounts of data
+   about the hdfs namespace. This information can then be used to explore
+   file system usage patterns or find specific files that match arbitrary
+   criteria, along with other types of namespace analysis. The Delimited
+   image processor in particular creates output that is amenable to
+   further processing by tools such as [38]Apache Pig. Pig provides a
+   particularly good choice for analyzing these data as it is able to deal
+   with the output generated from a small fsimage but also scales up to
+   consume data from extremely large file systems.
+
+   The Delimited image processor generates lines of text separated, by
+   default, by tabs and includes all of the fields that are common between
+   constructed files and files that were still under constructed when the
+   fsimage was generated. Examples scripts are provided demonstrating how
+   to use this output to accomplish three tasks: determine the number of
+   files each user has created on the file system, find files were created
+   but have not accessed, and find probable duplicates of large files by
+   comparing the size of each file.
+
+   Each of the following scripts assumes you have generated an output file
+   using the Delimited processor named foo and will be storing the results
+   of the Pig analysis in a file named results.
+
+** Total Number of Files for Each User
+
+   This script processes each path within the namespace, groups them by
+   the file owner and determines the total number of files each user owns.
+
+----
+      numFilesOfEachUser.pig:
+   -- This script determines the total number of files each user has in
+   -- the namespace. Its output is of the form:
+   --   username, totalNumFiles
+
+   -- Load all of the fields from the file
+   A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
+                                                    replication:int,
+                                                    modTime:chararray,
+                                                    accessTime:chararray,
+                                                    blockSize:long,
+                                                    numBlocks:int,
+                                                    fileSize:long,
+                                                    NamespaceQuota:int,
+                                                    DiskspaceQuota:int,
+                                                    perms:chararray,
+                                                    username:chararray,
+                                                    groupname:chararray);
+
+
+   -- Grab just the path and username
+   B = FOREACH A GENERATE path, username;
+
+   -- Generate the sum of the number of paths for each user
+   C = FOREACH (GROUP B BY username) GENERATE group, COUNT(B.path);
+
+   -- Save results
+   STORE C INTO '$outputFile';
+----
+
+   This script can be run against pig with the following command:
+
+----
+   bin/pig -x local -param inputFile=../foo -param outputFile=../results ../numFilesOfEachUser.pig
+----
+
+   The output file's content will be similar to that below:
+
+----
+   bart 1
+   lisa 16
+   homer 28
+   marge 2456
+----
+
+** Files That Have Never Been Accessed
+
+   This script finds files that were created but whose access times were
+   never changed, meaning they were never opened or viewed.
+
+----
+      neverAccessed.pig:
+   -- This script generates a list of files that were created but never
+   -- accessed, based on their AccessTime
+
+   -- Load all of the fields from the file
+   A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
+                                                    replication:int,
+                                                    modTime:chararray,
+                                                    accessTime:chararray,
+                                                    blockSize:long,
+                                                    numBlocks:int,
+                                                    fileSize:long,
+                                                    NamespaceQuota:int,
+                                                    DiskspaceQuota:int,
+                                                    perms:chararray,
+                                                    username:chararray,
+                                                    groupname:chararray);
+
+   -- Grab just the path and last time the file was accessed
+   B = FOREACH A GENERATE path, accessTime;
+
+   -- Drop all the paths that don't have the default assigned last-access time
+   C = FILTER B BY accessTime == '1969-12-31 16:00';
+
+   -- Drop the accessTimes, since they're all the same
+   D = FOREACH C GENERATE path;
+
+   -- Save results
+   STORE D INTO '$outputFile';
+----
+
+   This script can be run against pig with the following command and its
+   output file's content will be a list of files that were created but
+   never viewed afterwards.
+
+----
+   bin/pig -x local -param inputFile=../foo -param outputFile=../results ../neverAccessed.pig
+----
+
+** Probable Duplicated Files Based on File Size
+
+   This script groups files together based on their size, drops any that
+   are of less than 100mb and returns a list of the file size, number of
+   files found and a tuple of the file paths. This can be used to find
+   likely duplicates within the filesystem namespace.
+
+----
+      probableDuplicates.pig:
+   -- This script finds probable duplicate files greater than 100 MB by
+   -- grouping together files based on their byte size. Files of this size
+   -- with exactly the same number of bytes can be considered probable
+   -- duplicates, but should be checked further, either by comparing the
+   -- contents directly or by another proxy, such as a hash of the contents.
+   -- The scripts output is of the type:
+   --    fileSize numProbableDuplicates {(probableDup1), (probableDup2)}
+
+   -- Load all of the fields from the file
+   A = LOAD '$inputFile' USING PigStorage('\t') AS (path:chararray,
+                                                    replication:int,
+                                                    modTime:chararray,
+                                                    accessTime:chararray,
+                                                    blockSize:long,
+                                                    numBlocks:int,
+                                                    fileSize:long,
+                                                    NamespaceQuota:int,
+                                                    DiskspaceQuota:int,
+                                                    perms:chararray,
+                                                    username:chararray,
+                                                    groupname:chararray);
+
+   -- Grab the pathname and filesize
+   B = FOREACH A generate path, fileSize;
+
+   -- Drop files smaller than 100 MB
+   C = FILTER B by fileSize > 100L  * 1024L * 1024L;
+
+   -- Gather all the files of the same byte size
+   D = GROUP C by fileSize;
+
+   -- Generate path, num of duplicates, list of duplicates
+   E = FOREACH D generate group AS fileSize, COUNT(C) as numDupes, C.path AS files;
+
+   -- Drop all the files where there are only one of them
+   F = FILTER E by numDupes > 1L;
+
+   -- Sort by the size of the files
+   G = ORDER F by fileSize;
+
+   -- Save results
+   STORE G INTO '$outputFile';
+----
+
+   This script can be run against pig with the following command:
+
+----
+   bin/pig -x local -param inputFile=../foo -param outputFile=../results ../probableDuplicates.pig
+----
+
+   The output file's content will be similar to that below:
+
+----
+   1077288632 2 {(/user/tennant/work1/part-00501),(/user/tennant/work1/part-00993)}
+   1077288664 4 {(/user/tennant/work0/part-00567),(/user/tennant/work0/part-03980),(/user/tennant/work1/part-00725),(/user/eccelston/output/part-03395)}
+   1077288668 3 {(/user/tennant/work0/part-03705),(/user/tennant/work0/part-04242),(/user/tennant/work1/part-03839)}
+   1077288698 2 {(/user/tennant/work0/part-00435),(/user/eccelston/output/part-01382)}
+   1077288702 2 {(/user/tennant/work0/part-03864),(/user/eccelston/output/part-03234)}
+----
+
+   Each line includes the file size in bytes that was found to be
+   duplicated, the number of duplicates found, and a list of the
+   duplicated paths. Files less than 100MB are ignored, providing a
+   reasonable likelihood that files of these exact sizes may be
+   duplicates.

Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsPermissionsGuide.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,257 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+  HDFS Permissions Guide
+  ---
+  ---
+  ${maven.build.timestamp}
+
+HDFS Permissions Guide
+
+  \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+   The Hadoop Distributed File System (HDFS) implements a permissions
+   model for files and directories that shares much of the POSIX model.
+   Each file and directory is associated with an owner and a group. The
+   file or directory has separate permissions for the user that is the
+   owner, for other users that are members of the group, and for all other
+   users. For files, the r permission is required to read the file, and
+   the w permission is required to write or append to the file. For
+   directories, the r permission is required to list the contents of the
+   directory, the w permission is required to create or delete files or
+   directories, and the x permission is required to access a child of the
+   directory.
+
+   In contrast to the POSIX model, there are no setuid or setgid bits for
+   files as there is no notion of executable files. For directories, there
+   are no setuid or setgid bits directory as a simplification. The Sticky
+   bit can be set on directories, preventing anyone except the superuser,
+   directory owner or file owner from deleting or moving the files within
+   the directory. Setting the sticky bit for a file has no effect.
+   Collectively, the permissions of a file or directory are its mode. In
+   general, Unix customs for representing and displaying modes will be
+   used, including the use of octal numbers in this description. When a
+   file or directory is created, its owner is the user identity of the
+   client process, and its group is the group of the parent directory (the
+   BSD rule).
+
+   Each client process that accesses HDFS has a two-part identity composed
+   of the user name, and groups list. Whenever HDFS must do a permissions
+   check for a file or directory foo accessed by a client process,
+
+     * If the user name matches the owner of foo, then the owner
+       permissions are tested;
+     * Else if the group of foo matches any of member of the groups list,
+       then the group permissions are tested;
+     * Otherwise the other permissions of foo are tested.
+
+   If a permissions check fails, the client operation fails.
+
+* User Identity
+
+   As of Hadoop 0.22, Hadoop supports two different modes of operation to
+   determine the user's identity, specified by the
+   hadoop.security.authentication property:
+
+   * <<simple>>
+
+          In this mode of operation, the identity of a client process is
+          determined by the host operating system. On Unix-like systems,
+          the user name is the equivalent of `whoami`.
+
+   * <<kerberos>>
+
+          In Kerberized operation, the identity of a client process is
+          determined by its Kerberos credentials. For example, in a
+          Kerberized environment, a user may use the kinit utility to
+          obtain a Kerberos ticket-granting-ticket (TGT) and use klist to
+          determine their current principal. When mapping a Kerberos
+          principal to an HDFS username, all components except for the
+          primary are dropped. For example, a principal
+          todd/foobar@CORP.COMPANY.COM will act as the simple username
+          todd on HDFS.
+
+   Regardless of the mode of operation, the user identity mechanism is
+   extrinsic to HDFS itself. There is no provision within HDFS for
+   creating user identities, establishing groups, or processing user
+   credentials.
+
+* Group Mapping
+
+   Once a username has been determined as described above, the list of
+   groups is determined by a group mapping service, configured by the
+   hadoop.security.group.mapping property. The default implementation,
+   org.apache.hadoop.security.ShellBasedUnixGroupsMapping, will shell out
+   to the Unix bash -c groups command to resolve a list of groups for a
+   user.
+
+   An alternate implementation, which connects directly to an LDAP server
+   to resolve the list of groups, is available via
+   org.apache.hadoop.security.LdapGroupsMapping. However, this provider
+   should only be used if the required groups reside exclusively in LDAP,
+   and are not materialized on the Unix servers. More information on
+   configuring the group mapping service is available in the Javadocs.
+
+   For HDFS, the mapping of users to groups is performed on the NameNode.
+   Thus, the host system configuration of the NameNode determines the
+   group mappings for the users.
+
+   Note that HDFS stores the user and group of a file or directory as
+   strings; there is no conversion from user and group identity numbers as
+   is conventional in Unix.
+
+* Understanding the Implementation
+
+   Each file or directory operation passes the full path name to the name
+   node, and the permissions checks are applied along the path for each
+   operation. The client framework will implicitly associate the user
+   identity with the connection to the name node, reducing the need for
+   changes to the existing client API. It has always been the case that
+   when one operation on a file succeeds, the operation might fail when
+   repeated because the file, or some directory on the path, no longer
+   exists. For instance, when the client first begins reading a file, it
+   makes a first request to the name node to discover the location of the
+   first blocks of the file. A second request made to find additional
+   blocks may fail. On the other hand, deleting a file does not revoke
+   access by a client that already knows the blocks of the file. With the
+   addition of permissions, a client's access to a file may be withdrawn
+   between requests. Again, changing permissions does not revoke the
+   access of a client that already knows the file's blocks.
+
+* Changes to the File System API
+
+   All methods that use a path parameter will throw <<<AccessControlException>>>
+   if permission checking fails.
+
+   New methods:
+
+     * <<<public FSDataOutputStream create(Path f, FsPermission permission,
+       boolean overwrite, int bufferSize, short replication, long
+       blockSize, Progressable progress) throws IOException;>>>
+
+     * <<<public boolean mkdirs(Path f, FsPermission permission) throws
+       IOException;>>>
+
+     * <<<public void setPermission(Path p, FsPermission permission) throws
+       IOException;>>>
+
+     * <<<public void setOwner(Path p, String username, String groupname)
+       throws IOException;>>>
+
+     * <<<public FileStatus getFileStatus(Path f) throws IOException;>>>
+     
+       will additionally return the user, group and mode associated with the
+       path.
+
+   The mode of a new file or directory is restricted my the umask set as a
+   configuration parameter. When the existing <<<create(path, …)>>> method
+   (without the permission parameter) is used, the mode of the new file is
+   <<<0666 & ^umask>>>. When the new <<<create(path, permission, …)>>> method
+   (with the permission parameter P) is used, the mode of the new file is
+   <<<P & ^umask & 0666>>>. When a new directory is created with the existing
+   <<<mkdirs(path)>>>
+   method (without the permission parameter), the mode of the new
+   directory is <<<0777 & ^umask>>>. When the new <<<mkdirs(path, permission)>>>
+   method (with the permission parameter P) is used, the mode of new
+   directory is <<<P & ^umask & 0777>>>.
+
+* Changes to the Application Shell
+
+   New operations:
+
+     * <<<chmod [-R] mode file …>>>
+
+       Only the owner of a file or the super-user is permitted to change
+       the mode of a file.
+
+     * <<<chgrp [-R] group file …>>>
+
+       The user invoking chgrp must belong to the specified group and be
+       the owner of the file, or be the super-user.
+
+     * <<<chown [-R] [owner][:[group]] file …>>>
+
+       The owner of a file may only be altered by a super-user.
+
+     * <<<ls file …>>>
+
+     * <<<lsr file …>>>
+
+       The output is reformatted to display the owner, group and mode.
+
+* The Super-User
+
+   The super-user is the user with the same identity as name node process
+   itself. Loosely, if you started the name node, then you are the
+   super-user. The super-user can do anything in that permissions checks
+   never fail for the super-user. There is no persistent notion of who was
+   the super-user; when the name node is started the process identity
+   determines who is the super-user for now. The HDFS super-user does not
+   have to be the super-user of the name node host, nor is it necessary
+   that all clusters have the same super-user. Also, an experimenter
+   running HDFS on a personal workstation, conveniently becomes that
+   installation's super-user without any configuration.
+
+   In addition, the administrator my identify a distinguished group using
+   a configuration parameter. If set, members of this group are also
+   super-users.
+
+* The Web Server
+
+   By default, the identity of the web server is a configuration
+   parameter. That is, the name node has no notion of the identity of the
+   real user, but the web server behaves as if it has the identity (user
+   and groups) of a user chosen by the administrator. Unless the chosen
+   identity matches the super-user, parts of the name space may be
+   inaccessible to the web server.
+
+* Configuration Parameters
+
+     * <<<dfs.permissions = true>>>
+
+       If yes use the permissions system as described here. If no,
+       permission checking is turned off, but all other behavior is
+       unchanged. Switching from one parameter value to the other does not
+       change the mode, owner or group of files or directories.
+       Regardless of whether permissions are on or off, chmod, chgrp and
+       chown always check permissions. These functions are only useful in
+       the permissions context, and so there is no backwards compatibility
+       issue. Furthermore, this allows administrators to reliably set
+       owners and permissions in advance of turning on regular permissions
+       checking.
+
+     * <<<dfs.web.ugi = webuser,webgroup>>>
+
+       The user name to be used by the web server. Setting this to the
+       name of the super-user allows any web client to see everything.
+       Changing this to an otherwise unused identity allows web clients to
+       see only those things visible using "other" permissions. Additional
+       groups may be added to the comma-separated list.
+
+     * <<<dfs.permissions.superusergroup = supergroup>>>
+
+       The name of the group of super-users.
+
+     * <<<fs.permissions.umask-mode = 0022>>>
+
+       The umask used when creating files and directories. For
+       configuration files, the decimal value 18 may be used.
+
+     * <<<dfs.cluster.administrators = ACL-for-admins>>>
+
+       The administrators for the cluster specified as an ACL. This
+       controls who can access the default servlets, etc. in the HDFS.

Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsQuotaAdminGuide.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,118 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+  HDFS Quotas Guide
+  ---
+  ---
+  ${maven.build.timestamp}
+
+HDFS Quotas Guide
+
+  \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Overview
+
+   The Hadoop Distributed File System (HDFS) allows the administrator to
+   set quotas for the number of names used and the amount of space used
+   for individual directories. Name quotas and space quotas operate
+   independently, but the administration and implementation of the two
+   types of quotas are closely parallel.
+
+* Name Quotas
+
+   The name quota is a hard limit on the number of file and directory
+   names in the tree rooted at that directory. File and directory
+   creations fail if the quota would be exceeded. Quotas stick with
+   renamed directories; the rename operation fails if operation would
+   result in a quota violation. The attempt to set a quota will still
+   succeed even if the directory would be in violation of the new quota. A
+   newly created directory has no associated quota. The largest quota is
+   Long.Max_Value. A quota of one forces a directory to remain empty.
+   (Yes, a directory counts against its own quota!)
+
+   Quotas are persistent with the fsimage. When starting, if the fsimage
+   is immediately in violation of a quota (perhaps the fsimage was
+   surreptitiously modified), a warning is printed for each of such
+   violations. Setting or removing a quota creates a journal entry.
+
+* Space Quotas
+
+   The space quota is a hard limit on the number of bytes used by files in
+   the tree rooted at that directory. Block allocations fail if the quota
+   would not allow a full block to be written. Each replica of a block
+   counts against the quota. Quotas stick with renamed directories; the
+   rename operation fails if the operation would result in a quota
+   violation. A newly created directory has no associated quota. The
+   largest quota is <<<Long.Max_Value>>>. A quota of zero still permits files
+   to be created, but no blocks can be added to the files. Directories don't
+   use host file system space and don't count against the space quota. The
+   host file system space used to save the file meta data is not counted
+   against the quota. Quotas are charged at the intended replication
+   factor for the file; changing the replication factor for a file will
+   credit or debit quotas.
+
+   Quotas are persistent with the fsimage. When starting, if the fsimage
+   is immediately in violation of a quota (perhaps the fsimage was
+   surreptitiously modified), a warning is printed for each of such
+   violations. Setting or removing a quota creates a journal entry.
+
+* Administrative Commands
+
+   Quotas are managed by a set of commands available only to the
+   administrator.
+
+     * <<<dfsadmin -setQuota <N> <directory>...<directory> >>>
+
+       Set the name quota to be N for each directory. Best effort for each
+       directory, with faults reported if N is not a positive long
+       integer, the directory does not exist or it is a file, or the
+       directory would immediately exceed the new quota.
+
+     * <<<dfsadmin -clrQuota <directory>...<directory> >>>
+
+       Remove any name quota for each directory. Best effort for each
+       directory, with faults reported if the directory does not exist or
+       it is a file. It is not a fault if the directory has no quota.
+
+     * <<<dfsadmin -setSpaceQuota <N> <directory>...<directory> >>>
+
+       Set the space quota to be N bytes for each directory. This is a
+       hard limit on total size of all the files under the directory tree.
+       The space quota takes replication also into account, i.e. one GB of
+       data with replication of 3 consumes 3GB of quota. N can also be
+       specified with a binary prefix for convenience, for e.g. 50g for 50
+       gigabytes and 2t for 2 terabytes etc. Best effort for each
+       directory, with faults reported if N is neither zero nor a positive
+       integer, the directory does not exist or it is a file, or the
+       directory would immediately exceed the new quota.
+
+     * <<<dfsadmin -clrSpaceQuota <directory>...<director> >>>
+
+       Remove any space quota for each directory. Best effort for each
+       directory, with faults reported if the directory does not exist or
+       it is a file. It is not a fault if the directory has no quota.
+
+* Reporting Command
+
+   An an extension to the count command of the HDFS shell reports quota
+   values and the current count of names and bytes in use.
+
+     * <<<fs -count -q <directory>...<directory> >>>
+
+       With the -q option, also report the name quota value set for each
+       directory, the available name quota remaining, the space quota
+       value set, and the available space quota remaining. If the
+       directory does not have a quota set, the reported values are <<<none>>>
+       and <<<inf>>>.

Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/HdfsUserGuide.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,499 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+  HDFS Users Guide
+  ---
+  ---
+  ${maven.build.timestamp}
+
+HDFS Users Guide
+
+%{toc|section=1|fromDepth=0}
+
+* Purpose
+
+   This document is a starting point for users working with Hadoop
+   Distributed File System (HDFS) either as a part of a Hadoop cluster or
+   as a stand-alone general purpose distributed file system. While HDFS is
+   designed to "just work" in many environments, a working knowledge of
+   HDFS helps greatly with configuration improvements and diagnostics on a
+   specific cluster.
+
+* Overview
+
+   HDFS is the primary distributed storage used by Hadoop applications. A
+   HDFS cluster primarily consists of a NameNode that manages the file
+   system metadata and DataNodes that store the actual data. The HDFS
+   Architecture Guide describes HDFS in detail. This user guide primarily
+   deals with the interaction of users and administrators with HDFS
+   clusters. The HDFS architecture diagram depicts basic interactions
+   among NameNode, the DataNodes, and the clients. Clients contact
+   NameNode for file metadata or file modifications and perform actual
+   file I/O directly with the DataNodes.
+
+   The following are some of the salient features that could be of
+   interest to many users.
+
+     * Hadoop, including HDFS, is well suited for distributed storage and
+       distributed processing using commodity hardware. It is fault
+       tolerant, scalable, and extremely simple to expand. MapReduce, well
+       known for its simplicity and applicability for large set of
+       distributed applications, is an integral part of Hadoop.
+
+     * HDFS is highly configurable with a default configuration well
+       suited for many installations. Most of the time, configuration
+       needs to be tuned only for very large clusters.
+
+     * Hadoop is written in Java and is supported on all major platforms.
+
+     * Hadoop supports shell-like commands to interact with HDFS directly.
+
+     * The NameNode and Datanodes have built in web servers that makes it
+       easy to check current status of the cluster.
+
+     * New features and improvements are regularly implemented in HDFS.
+       The following is a subset of useful features in HDFS:
+
+          * File permissions and authentication.
+
+          * Rack awareness: to take a node's physical location into
+            account while scheduling tasks and allocating storage.
+
+          * Safemode: an administrative mode for maintenance.
+
+          * <<<fsck>>>: a utility to diagnose health of the file system, to find
+            missing files or blocks.
+
+          * <<<fetchdt>>>: a utility to fetch DelegationToken and store it in a
+            file on the local system.
+
+          * Rebalancer: tool to balance the cluster when the data is
+            unevenly distributed among DataNodes.
+
+          * Upgrade and rollback: after a software upgrade, it is possible
+            to rollback to HDFS' state before the upgrade in case of
+            unexpected problems.
+
+          * Secondary NameNode: performs periodic checkpoints of the
+            namespace and helps keep the size of file containing log of
+            HDFS modifications within certain limits at the NameNode.
+
+          * Checkpoint node: performs periodic checkpoints of the
+            namespace and helps minimize the size of the log stored at the
+            NameNode containing changes to the HDFS. Replaces the role
+            previously filled by the Secondary NameNode, though is not yet
+            battle hardened. The NameNode allows multiple Checkpoint nodes
+            simultaneously, as long as there are no Backup nodes
+            registered with the system.
+
+          * Backup node: An extension to the Checkpoint node. In addition
+            to checkpointing it also receives a stream of edits from the
+            NameNode and maintains its own in-memory copy of the
+            namespace, which is always in sync with the active NameNode
+            namespace state. Only one Backup node may be registered with
+            the NameNode at once.
+
+* Prerequisites
+
+   The following documents describe how to install and set up a Hadoop
+   cluster:
+
+     * {{Single Node Setup}} for first-time users.
+
+     * {{Cluster Setup}} for large, distributed clusters.
+
+   The rest of this document assumes the user is able to set up and run a
+   HDFS with at least one DataNode. For the purpose of this document, both
+   the NameNode and DataNode could be running on the same physical
+   machine.
+
+* Web Interface
+
+   NameNode and DataNode each run an internal web server in order to
+   display basic information about the current status of the cluster. With
+   the default configuration, the NameNode front page is at
+   <<<http://namenode-name:50070/>>>. It lists the DataNodes in the cluster and
+   basic statistics of the cluster. The web interface can also be used to
+   browse the file system (using "Browse the file system" link on the
+   NameNode front page).
+
+* Shell Commands
+
+   Hadoop includes various shell-like commands that directly interact with
+   HDFS and other file systems that Hadoop supports. The command <<<bin/hdfs dfs -help>>>
+   lists the commands supported by Hadoop shell. Furthermore,
+   the command <<<bin/hdfs dfs -help command-name>>> displays more detailed help
+   for a command. These commands support most of the normal files system
+   operations like copying files, changing file permissions, etc. It also
+   supports a few HDFS specific operations like changing replication of
+   files. For more information see {{{File System Shell Guide}}}.
+
+**  DFSAdmin Command
+
+   The <<<bin/hadoop dfsadmin>>> command supports a few HDFS administration
+   related operations. The <<<bin/hadoop dfsadmin -help>>> command lists all the
+   commands currently supported. For e.g.:
+
+     * <<<-report>>>: reports basic statistics of HDFS. Some of this
+       information is also available on the NameNode front page.
+
+     * <<<-safemode>>>: though usually not required, an administrator can
+       manually enter or leave Safemode.
+
+     * <<<-finalizeUpgrade>>>: removes previous backup of the cluster made
+       during last upgrade.
+
+     * <<<-refreshNodes>>>: Updates the namenode with the set of datanodes
+       allowed to connect to the namenode. Namenodes re-read datanode
+       hostnames in the file defined by <<<dfs.hosts>>>, <<<dfs.hosts.exclude>>>.
+       Hosts defined in <<<dfs.hosts>>> are the datanodes that are part of the
+       cluster. If there are entries in <<<dfs.hosts>>>, only the hosts in it
+       are allowed to register with the namenode. Entries in
+       <<<dfs.hosts.exclude>>> are datanodes that need to be decommissioned.
+       Datanodes complete decommissioning when all the replicas from them
+       are replicated to other datanodes. Decommissioned nodes are not
+       automatically shutdown and are not chosen for writing for new
+       replicas.
+
+     * <<<-printTopology>>> : Print the topology of the cluster. Display a tree
+       of racks and datanodes attached to the tracks as viewed by the
+       NameNode.
+
+   For command usage, see {{{dfsadmin}}}.
+
+* Secondary NameNode
+
+   The NameNode stores modifications to the file system as a log appended
+   to a native file system file, edits. When a NameNode starts up, it
+   reads HDFS state from an image file, fsimage, and then applies edits
+   from the edits log file. It then writes new HDFS state to the fsimage
+   and starts normal operation with an empty edits file. Since NameNode
+   merges fsimage and edits files only during start up, the edits log file
+   could get very large over time on a busy cluster. Another side effect
+   of a larger edits file is that next restart of NameNode takes longer.
+
+   The secondary NameNode merges the fsimage and the edits log files
+   periodically and keeps edits log size within a limit. It is usually run
+   on a different machine than the primary NameNode since its memory
+   requirements are on the same order as the primary NameNode.
+
+   The start of the checkpoint process on the secondary NameNode is
+   controlled by two configuration parameters.
+
+     * <<<dfs.namenode.checkpoint.period>>>, set to 1 hour by default, specifies
+       the maximum delay between two consecutive checkpoints, and
+
+     * <<<dfs.namenode.checkpoint.txns>>>, set to 40000 default, defines the
+       number of uncheckpointed transactions on the NameNode which will
+       force an urgent checkpoint, even if the checkpoint period has not
+       been reached.
+
+   The secondary NameNode stores the latest checkpoint in a directory
+   which is structured the same way as the primary NameNode's directory.
+   So that the check pointed image is always ready to be read by the
+   primary NameNode if necessary.
+
+   For command usage, see {{{secondarynamenode}}}.
+
+* Checkpoint Node
+
+   NameNode persists its namespace using two files: fsimage, which is the
+   latest checkpoint of the namespace and edits, a journal (log) of
+   changes to the namespace since the checkpoint. When a NameNode starts
+   up, it merges the fsimage and edits journal to provide an up-to-date
+   view of the file system metadata. The NameNode then overwrites fsimage
+   with the new HDFS state and begins a new edits journal.
+
+   The Checkpoint node periodically creates checkpoints of the namespace.
+   It downloads fsimage and edits from the active NameNode, merges them
+   locally, and uploads the new image back to the active NameNode. The
+   Checkpoint node usually runs on a different machine than the NameNode
+   since its memory requirements are on the same order as the NameNode.
+   The Checkpoint node is started by bin/hdfs namenode -checkpoint on the
+   node specified in the configuration file.
+
+   The location of the Checkpoint (or Backup) node and its accompanying
+   web interface are configured via the <<<dfs.namenode.backup.address>>> and
+   <<<dfs.namenode.backup.http-address>>> configuration variables.
+
+   The start of the checkpoint process on the Checkpoint node is
+   controlled by two configuration parameters.
+
+     * <<<dfs.namenode.checkpoint.period>>>, set to 1 hour by default, specifies
+       the maximum delay between two consecutive checkpoints
+
+     * <<<dfs.namenode.checkpoint.txns>>>, set to 40000 default, defines the
+       number of uncheckpointed transactions on the NameNode which will
+       force an urgent checkpoint, even if the checkpoint period has not
+       been reached.
+
+   The Checkpoint node stores the latest checkpoint in a directory that is
+   structured the same as the NameNode's directory. This allows the
+   checkpointed image to be always available for reading by the NameNode
+   if necessary. See Import checkpoint.
+
+   Multiple checkpoint nodes may be specified in the cluster configuration
+   file.
+
+   For command usage, see {{{namenode}}}.
+
+* Backup Node
+
+   The Backup node provides the same checkpointing functionality as the
+   Checkpoint node, as well as maintaining an in-memory, up-to-date copy
+   of the file system namespace that is always synchronized with the
+   active NameNode state. Along with accepting a journal stream of file
+   system edits from the NameNode and persisting this to disk, the Backup
+   node also applies those edits into its own copy of the namespace in
+   memory, thus creating a backup of the namespace.
+
+   The Backup node does not need to download fsimage and edits files from
+   the active NameNode in order to create a checkpoint, as would be
+   required with a Checkpoint node or Secondary NameNode, since it already
+   has an up-to-date state of the namespace state in memory. The Backup
+   node checkpoint process is more efficient as it only needs to save the
+   namespace into the local fsimage file and reset edits.
+
+   As the Backup node maintains a copy of the namespace in memory, its RAM
+   requirements are the same as the NameNode.
+
+   The NameNode supports one Backup node at a time. No Checkpoint nodes
+   may be registered if a Backup node is in use. Using multiple Backup
+   nodes concurrently will be supported in the future.
+
+   The Backup node is configured in the same manner as the Checkpoint
+   node. It is started with <<<bin/hdfs namenode -backup>>>.
+
+   The location of the Backup (or Checkpoint) node and its accompanying
+   web interface are configured via the <<<dfs.namenode.backup.address>>> and
+   <<<dfs.namenode.backup.http-address>>> configuration variables.
+
+   Use of a Backup node provides the option of running the NameNode with
+   no persistent storage, delegating all responsibility for persisting the
+   state of the namespace to the Backup node. To do this, start the
+   NameNode with the <<<-importCheckpoint>>> option, along with specifying no
+   persistent storage directories of type edits <<<dfs.namenode.edits.dir>>> for
+   the NameNode configuration.
+
+   For a complete discussion of the motivation behind the creation of the
+   Backup node and Checkpoint node, see {{{https://issues.apache.org/jira/browse/HADOOP-4539}HADOOP-4539}}.
+   For command usage, see {{{namenode}}}.
+
+* Import Checkpoint
+
+   The latest checkpoint can be imported to the NameNode if all other
+   copies of the image and the edits files are lost. In order to do that
+   one should:
+
+     * Create an empty directory specified in the <<<dfs.namenode.name.dir>>>
+       configuration variable;
+
+     * Specify the location of the checkpoint directory in the
+       configuration variable <<<dfs.namenode.checkpoint.dir>>>;
+
+     * and start the NameNode with <<<-importCheckpoint>>> option.
+
+   The NameNode will upload the checkpoint from the
+   <<<dfs.namenode.checkpoint.dir>>> directory and then save it to the NameNode
+   directory(s) set in <<<dfs.namenode.name.dir>>>. The NameNode will fail if a
+   legal image is contained in <<<dfs.namenode.name.dir>>>. The NameNode
+   verifies that the image in <<<dfs.namenode.checkpoint.dir>>> is consistent,
+   but does not modify it in any way.
+
+   For command usage, see {{{namenode}}}.
+
+* Rebalancer
+
+   HDFS data might not always be be placed uniformly across the DataNode.
+   One common reason is addition of new DataNodes to an existing cluster.
+   While placing new blocks (data for a file is stored as a series of
+   blocks), NameNode considers various parameters before choosing the
+   DataNodes to receive these blocks. Some of the considerations are:
+
+     * Policy to keep one of the replicas of a block on the same node as
+       the node that is writing the block.
+
+     * Need to spread different replicas of a block across the racks so
+       that cluster can survive loss of whole rack.
+
+     * One of the replicas is usually placed on the same rack as the node
+       writing to the file so that cross-rack network I/O is reduced.
+
+     * Spread HDFS data uniformly across the DataNodes in the cluster.
+
+   Due to multiple competing considerations, data might not be uniformly
+   placed across the DataNodes. HDFS provides a tool for administrators
+   that analyzes block placement and rebalanaces data across the DataNode.
+   A brief administrator's guide for rebalancer as a PDF is attached to
+   {{{https://issues.apache.org/jira/browse/HADOOP-1652}HADOOP-1652}}.
+
+   For command usage, see {{{balancer}}}.
+
+* Rack Awareness
+
+   Typically large Hadoop clusters are arranged in racks and network
+   traffic between different nodes with in the same rack is much more
+   desirable than network traffic across the racks. In addition NameNode
+   tries to place replicas of block on multiple racks for improved fault
+   tolerance. Hadoop lets the cluster administrators decide which rack a
+   node belongs to through configuration variable
+   <<<net.topology.script.file.name>>>. When this script is configured, each
+   node runs the script to determine its rack id. A default installation
+   assumes all the nodes belong to the same rack. This feature and
+   configuration is further described in PDF attached to
+   {{{https://issues.apache.org/jira/browse/HADOOP-692}HADOOP-692}}.
+
+* Safemode
+
+   During start up the NameNode loads the file system state from the
+   fsimage and the edits log file. It then waits for DataNodes to report
+   their blocks so that it does not prematurely start replicating the
+   blocks though enough replicas already exist in the cluster. During this
+   time NameNode stays in Safemode. Safemode for the NameNode is
+   essentially a read-only mode for the HDFS cluster, where it does not
+   allow any modifications to file system or blocks. Normally the NameNode
+   leaves Safemode automatically after the DataNodes have reported that
+   most file system blocks are available. If required, HDFS could be
+   placed in Safemode explicitly using <<<bin/hadoop dfsadmin -safemode>>>
+   command. NameNode front page shows whether Safemode is on or off. A
+   more detailed description and configuration is maintained as JavaDoc
+   for <<<setSafeMode()>>>.
+
+* fsck
+
+   HDFS supports the fsck command to check for various inconsistencies. It
+   it is designed for reporting problems with various files, for example,
+   missing blocks for a file or under-replicated blocks. Unlike a
+   traditional fsck utility for native file systems, this command does not
+   correct the errors it detects. Normally NameNode automatically corrects
+   most of the recoverable failures. By default fsck ignores open files
+   but provides an option to select all files during reporting. The HDFS
+   fsck command is not a Hadoop shell command. It can be run as
+   <<<bin/hadoop fsck>>>. For command usage, see {{{fsck}}}. fsck can be run on the
+   whole file system or on a subset of files.
+
+* fetchdt
+
+   HDFS supports the fetchdt command to fetch Delegation Token and store
+   it in a file on the local system. This token can be later used to
+   access secure server (NameNode for example) from a non secure client.
+   Utility uses either RPC or HTTPS (over Kerberos) to get the token, and
+   thus requires kerberos tickets to be present before the run (run kinit
+   to get the tickets). The HDFS fetchdt command is not a Hadoop shell
+   command. It can be run as <<<bin/hadoop fetchdt DTfile>>>. After you got
+   the token you can run an HDFS command without having Kerberos tickets,
+   by pointing <<<HADOOP_TOKEN_FILE_LOCATION>>> environmental variable to the
+   delegation token file. For command usage, see {{{fetchdt}}} command.
+
+* Recovery Mode
+
+   Typically, you will configure multiple metadata storage locations.
+   Then, if one storage location is corrupt, you can read the metadata
+   from one of the other storage locations.
+
+   However, what can you do if the only storage locations available are
+   corrupt? In this case, there is a special NameNode startup mode called
+   Recovery mode that may allow you to recover most of your data.
+
+   You can start the NameNode in recovery mode like so: <<<namenode -recover>>>
+
+   When in recovery mode, the NameNode will interactively prompt you at
+   the command line about possible courses of action you can take to
+   recover your data.
+
+   If you don't want to be prompted, you can give the <<<-force>>> option. This
+   option will force recovery mode to always select the first choice.
+   Normally, this will be the most reasonable choice.
+
+   Because Recovery mode can cause you to lose data, you should always
+   back up your edit log and fsimage before using it.
+
+* Upgrade and Rollback
+
+   When Hadoop is upgraded on an existing cluster, as with any software
+   upgrade, it is possible there are new bugs or incompatible changes that
+   affect existing applications and were not discovered earlier. In any
+   non-trivial HDFS installation, it is not an option to loose any data,
+   let alone to restart HDFS from scratch. HDFS allows administrators to
+   go back to earlier version of Hadoop and rollback the cluster to the
+   state it was in before the upgrade. HDFS upgrade is described in more
+   detail in {{{Hadoop Upgrade}}} Wiki page. HDFS can have one such backup at a
+   time. Before upgrading, administrators need to remove existing backup
+   using bin/hadoop dfsadmin <<<-finalizeUpgrade>>> command. The following
+   briefly describes the typical upgrade procedure:
+
+     * Before upgrading Hadoop software, finalize if there an existing
+       backup. <<<dfsadmin -upgradeProgress>>> status can tell if the cluster
+       needs to be finalized.
+
+     * Stop the cluster and distribute new version of Hadoop.
+
+     * Run the new version with <<<-upgrade>>> option (<<<bin/start-dfs.sh -upgrade>>>).
+
+     * Most of the time, cluster works just fine. Once the new HDFS is
+       considered working well (may be after a few days of operation),
+       finalize the upgrade. Note that until the cluster is finalized,
+       deleting the files that existed before the upgrade does not free up
+       real disk space on the DataNodes.
+
+     * If there is a need to move back to the old version,
+
+          * stop the cluster and distribute earlier version of Hadoop.
+
+          * start the cluster with rollback option. (<<<bin/start-dfs.h -rollback>>>).
+
+* File Permissions and Security
+
+   The file permissions are designed to be similar to file permissions on
+   other familiar platforms like Linux. Currently, security is limited to
+   simple file permissions. The user that starts NameNode is treated as
+   the superuser for HDFS. Future versions of HDFS will support network
+   authentication protocols like Kerberos for user authentication and
+   encryption of data transfers. The details are discussed in the
+   Permissions Guide.
+
+* Scalability
+
+   Hadoop currently runs on clusters with thousands of nodes. The
+   {{{PoweredBy}}} Wiki page lists some of the organizations that deploy Hadoop
+   on large clusters. HDFS has one NameNode for each cluster. Currently
+   the total memory available on NameNode is the primary scalability
+   limitation. On very large clusters, increasing average size of files
+   stored in HDFS helps with increasing cluster size without increasing
+   memory requirements on NameNode. The default configuration may not
+   suite very large clustes. The {{{FAQ}}} Wiki page lists suggested
+   configuration improvements for large Hadoop clusters.
+
+* Related Documentation
+
+   This user guide is a good starting point for working with HDFS. While
+   the user guide continues to improve, there is a large wealth of
+   documentation about Hadoop and HDFS. The following list is a starting
+   point for further exploration:
+
+     * {{{Hadoop Site}}}: The home page for the Apache Hadoop site.
+
+     * {{{Hadoop Wiki}}}: The home page (FrontPage) for the Hadoop Wiki. Unlike
+       the released documentation, which is part of Hadoop source tree,
+       Hadoop Wiki is regularly edited by Hadoop Community.
+
+     * {{{FAQ}}}: The FAQ Wiki page.
+
+     * {{{Hadoop JavaDoc API}}}.
+
+     * {{{Hadoop User Mailing List}}}: core-user[at]hadoop.apache.org.
+
+     * Explore {{{src/hdfs/hdfs-default.xml}}}. It includes brief description of
+       most of the configuration variables available.
+
+     * {{{Hadoop Commands Guide}}}: Hadoop commands usage.

Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/Hftp.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,60 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+  HFTP Guide
+  ---
+  ---
+  ${maven.build.timestamp}
+
+HFTP Guide
+
+  \[ {{{./index.html}Go Back}} \]
+
+%{toc|section=1|fromDepth=0}
+
+* Introduction
+
+   HFTP is a Hadoop filesystem implementation that lets you read data from
+   a remote Hadoop HDFS cluster. The reads are done via HTTP, and data is
+   sourced from DataNodes. HFTP is a read-only filesystem, and will throw
+   exceptions if you try to use it to write data or modify the filesystem
+   state.
+
+   HFTP is primarily useful if you have multiple HDFS clusters with
+   different versions and you need to move data from one to another. HFTP
+   is wire-compatible even between different versions of HDFS. For
+   example, you can do things like: <<<hadoop distcp -i hftp://sourceFS:50070/src hdfs://destFS:50070/dest>>>.
+   Note that HFTP is read-only so the destination must be an HDFS filesystem.
+   (Also, in this example, the distcp should be run using the configuraton of
+   the new filesystem.)
+
+   An extension, HSFTP, uses HTTPS by default. This means that data will
+   be encrypted in transit.
+
+* Implementation
+
+   The code for HFTP lives in the Java class
+   <<<org.apache.hadoop.hdfs.HftpFileSystem>>>. Likewise, HSFTP is implemented
+   in <<<org.apache.hadoop.hdfs.HsftpFileSystem>>>.
+
+* Configuration Options
+
+*-----------------------:-----------------------------------+
+| <<Name>>              | <<Description>>                   |
+*-----------------------:-----------------------------------+
+| <<<dfs.hftp.https.port>>> | the HTTPS port on the remote cluster. If not set,
+|                       |   HFTP will fall back on <<<dfs.https.port>>>.
+*-----------------------:-----------------------------------+
+| <<<hdfs.service.host_ip:port>>> | Specifies the service name (for the security
+|                       |  subsystem) associated with the HFTP filesystem running at ip:port.
+*-----------------------:-----------------------------------+

Added: hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm
URL: http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm?rev=1440245&view=auto
==============================================================================
--- hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm (added)
+++ hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/site/apt/LibHdfs.apt.vm Wed Jan 30 01:52:14 2013
@@ -0,0 +1,94 @@
+~~ Licensed under the Apache License, Version 2.0 (the "License");
+~~ you may not use this file except in compliance with the License.
+~~ You may obtain a copy of the License at
+~~
+~~   http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License. See accompanying LICENSE file.
+
+  ---
+  C API libhdfs
+  ---
+  ---
+  ${maven.build.timestamp}
+
+C API libhdfs
+
+%{toc|section=1|fromDepth=0} 
+
+* Overview
+
+   libhdfs is a JNI based C API for Hadoop's Distributed File System
+   (HDFS). It provides C APIs to a subset of the HDFS APIs to manipulate
+   HDFS files and the filesystem. libhdfs is part of the Hadoop
+   distribution and comes pre-compiled in
+   <<<${HADOOP_PREFIX}/libhdfs/libhdfs.so>>> .
+
+* The APIs
+
+   The libhdfs APIs are a subset of: {{{hadoop fs APIs}}}.
+
+   The header file for libhdfs describes each API in detail and is
+   available in <<<${HADOOP_PREFIX}/src/c++/libhdfs/hdfs.h>>>
+
+* A Sample Program
+
+----
+    \#include "hdfs.h"
+
+    int main(int argc, char **argv) {
+
+        hdfsFS fs = hdfsConnect("default", 0);
+        const char* writePath = "/tmp/testfile.txt";
+        hdfsFile writeFile = hdfsOpenFile(fs, writePath, O_WRONLY|O_CREAT, 0, 0, 0);
+        if(!writeFile) {
+              fprintf(stderr, "Failed to open %s for writing!\n", writePath);
+              exit(-1);
+        }
+        char* buffer = "Hello, World!";
+        tSize num_written_bytes = hdfsWrite(fs, writeFile, (void*)buffer, strlen(buffer)+1);
+        if (hdfsFlush(fs, writeFile)) {
+               fprintf(stderr, "Failed to 'flush' %s\n", writePath);
+              exit(-1);
+        }
+       hdfsCloseFile(fs, writeFile);
+    }
+----
+
+* How To Link With The Library
+
+   See the Makefile for <<<hdfs_test.c>>> in the libhdfs source directory
+   (<<<${HADOOP_PREFIX}/src/c++/libhdfs/Makefile>>>) or something like:
+   <<<gcc above_sample.c -I${HADOOP_PREFIX}/src/c++/libhdfs -L${HADOOP_PREFIX}/libhdfs -lhdfs -o above_sample>>>
+
+* Common Problems
+
+   The most common problem is the <<<CLASSPATH>>> is not set properly when
+   calling a program that uses libhdfs. Make sure you set it to all the
+   Hadoop jars needed to run Hadoop itself. Currently, there is no way to
+   programmatically generate the classpath, but a good bet is to include
+   all the jar files in <<<${HADOOP_PREFIX}>>> and <<<${HADOOP_PREFIX}/lib>>> as well
+   as the right configuration directory containing <<<hdfs-site.xml>>>
+
+* Thread Safe
+
+   libdhfs is thread safe.
+
+     * Concurrency and Hadoop FS "handles"
+
+       The Hadoop FS implementation includes a FS handle cache which
+       caches based on the URI of the namenode along with the user
+       connecting. So, all calls to <<<hdfsConnect>>> will return the same
+       handle but calls to <<<hdfsConnectAsUser>>> with different users will
+       return different handles. But, since HDFS client handles are
+       completely thread safe, this has no bearing on concurrency.
+
+     * Concurrency and libhdfs/JNI
+
+       The libhdfs calls to JNI should always be creating thread local
+       storage, so (in theory), libhdfs should be as thread safe as the
+       underlying calls to the Hadoop FS.



Mime
View raw message