- +

Hardware Failure

Hardware failure is the norm rather than the exception. An HDFS instance may consist of hundreds or thousands of server machines, each storing part of the file system’s data. The fact that there are a huge number of components and that each component has a non-trivial probability of failure means that some component of HDFS is always non-functional. Therefore, detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS.

- +

Streaming Data Access

Applications that run on HDFS need streaming access to their data sets. They are not general purpose applications that typically run on general purpose file systems. HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. POSIX imposes many hard requirements that are not needed for applications that are targeted for HDFS. POSIX semantics in a few key areas has been traded to increase data throughput rates.

- +

Large Data Sets

Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support large files. It should provide high aggregate data bandwidth and scale to hundreds of nodes in a single cluster. It should support tens of millions of files in a single instance.

- +

Simple Coherency Model

HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. This assumption simplifies data coherency issues and enables high throughput data access. A MapReduce application or a web crawler application fits perfectly with this model. There is a plan to support appending-writes to files in the future.

- +

“Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data set is huge. This minimizes network congestion and increases the overall throughput of the system. The assumption is that it is often better to migrate the computation closer to where the data is located rather than moving the data to where the application is running. HDFS provides interfaces for applications to move themselves closer to where the data is located.

- +

Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to be easily portable from one platform to another. This facilitates widespread adoption of HDFS as a platform of choice for a large set of applications. @@ -333,7 +333,7 @@ - +

Namenode and Datanodes

@@ -352,7 +352,7 @@ - +

The File System Namespace

@@ -366,7 +366,7 @@ - +

Data Replication

@@ -377,7 +377,7 @@

- +

Replica Placement: The First Baby Steps

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience. The purpose of a rack-aware replica placement policy is to improve data reliability, availability, and network bandwidth utilization. The current implementation for the replica placement policy is a first effort in this direction. The short-term goals of implementing this policy are to validate it on production systems, learn more about its behavior, and build a foundation to test and research more sophisticated policies. @@ -394,12 +394,12 @@

The current, default replica placement policy described here is a work in progress.

- +

Replica Selection

To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

- +

SafeMode

On startup, the Namenode enters a special state called Safemode. Replication of data blocks does not occur when the Namenode is in the Safemode state. The Namenode receives Heartbeat and Blockreport messages from the Datanodes. A Blockreport contains the list of data blocks that a Datanode is hosting. Each block has a specified minimum number of replicas. A block is considered safely replicated when the minimum number of replicas of that data block has checked in with the Namenode. After a configurable percentage of safely replicated data blocks checks in with the Namenode (plus an additional 30 seconds), the Namenode exits the Safemode state. It then determines the list of data blocks (if any) that still have fewer than the specified number of replicas. The Namenode then replicates these blocks to other Datanodes. @@ -407,7 +407,7 @@

- +

The Persistence of File System Metadata

@@ -423,7 +423,7 @@ - +

The Communication Protocols

@@ -433,29 +433,29 @@ - +

Robustness

The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are Namenode failures, Datanode failures and network partitions.

- +

Data Disk Failure, Heartbeats and Re-Replication

Each Datanode sends a Heartbeat message to the Namenode periodically. A network partition can cause a subset of Datanodes to lose connectivity with the Namenode. The Namenode detects this condition by the absence of a Heartbeat message. The Namenode marks Datanodes without recent Heartbeats as dead and does not forward any new IO requests to them. Any data that was registered to a dead Datanode is not available to HDFS any more. Datanode death may cause the replication factor of some blocks to fall below their specified value. The Namenode constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The necessity for re-replication may arise due to many reasons: a Datanode may become unavailable, a replica may become corrupted, a hard disk on a Datanode may fail, or the replication factor of a file may be increased.

- +

Cluster Rebalancing

The HDFS architecture is compatible with data rebalancing schemes. A scheme might automatically move data from one Datanode to another if the free space on a Datanode falls below a certain threshold. In the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster. These types of data rebalancing schemes are not yet implemented.

- +

Data Integrity

It is possible that a block of data fetched from a Datanode arrives corrupted. This corruption can occur because of faults in a storage device, network faults, or buggy software. The HDFS client software implements checksum checking on the contents of HDFS files. When a client creates an HDFS file, it computes a checksum of each block of the file and stores these checksums in a separate hidden file in the same HDFS namespace. When a client retrieves file contents it verifies that the data it received from each Datanode matches the checksum stored in the associated checksum file. If not, then the client can opt to retrieve that block from another Datanode that has a replica of that block.

- +

Metadata Disk Failure

The FsImage and the EditLog are central data structures of HDFS. A corruption of these files can cause the HDFS instance to be non-functional. For this reason, the Namenode can be configured to support maintaining multiple copies of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of the FsImages and EditLogs to get updated synchronously. This synchronous updating of multiple copies of the FsImage and EditLog may degrade the rate of namespace transactions per second that a Namenode can support. However, this degradation is acceptable because even though HDFS applications are very data intensive in nature, they are not metadata intensive. When a Namenode restarts, it selects the latest consistent FsImage and EditLog to use. @@ -463,7 +463,7 @@

The Namenode machine is a single point of failure for an HDFS cluster. If the Namenode machine fails, manual intervention is necessary. Currently, automatic restart and failover of the Namenode software to another machine is not supported.

- +

Snapshots

Snapshots support storing a copy of data at a particular instant of time. One usage of the snapshot feature may be to roll back a corrupted HDFS instance to a previously known good point in time. HDFS does not currently support snapshots but will in a future release. @@ -472,15 +472,15 @@ - +

Data Organization

- +

Data Blocks

HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a different Datanode.

- +

Staging

A client request to create a file does not reach the Namenode immediately. In fact, initially the HDFS client caches the file data into a temporary local file. Application writes are transparently redirected to this temporary local file. When the local file accumulates data worth over one HDFS block size, the client contacts the Namenode. The Namenode inserts the file name into the file system hierarchy and allocates a data block for it. The Namenode responds to the client request with the identity of the Datanode and the destination data block. Then the client flushes the block of data from the local temporary file to the specified Datanode. When a file is closed, the remaining un-flushed data in the temporary local file is transferred to the Datanode. The client then tells the Namenode that the file is closed. At this point, the Namenode commits the file creation operation into a persistent store. If the Namenode dies before the file is closed, the file is lost. @@ -488,7 +488,7 @@

The above approach has been adopted after careful consideration of target applications that run on HDFS. These applications need streaming writes to files. If a client writes to a remote file directly without any client side buffering, the network speed and the congestion in the network impacts throughput considerably. This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have used client side caching to improve performance. A POSIX requirement has been relaxed to achieve higher performance of data uploads.

- +

Replication Pipelining

When a client is writing data to an HDFS file, its data is first written to a local file as explained in the previous section. Suppose the HDFS file has a replication factor of three. When the local file accumulates a full block of user data, the client retrieves a list of Datanodes from the Namenode. This list contains the Datanodes that will host a replica of that block. The client then flushes the data block to the first Datanode. The first Datanode starts receiving the data in small portions (4 KB), writes each portion to its local repository and transfers that portion to the second Datanode in the list. The second Datanode, in turn starts receiving each portion of the data block, writes that portion to its repository and then flushes that portion to the third Datanode. Finally, the third Datanode writes the data to its local repository. Thus, a Datanode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next o ne in the pipeline. Thus, the data is pipelined from one Datanode to the next. @@ -496,13 +496,13 @@

- +

Accessibility

HDFS can be accessed from applications in many different ways. Natively, HDFS provides a Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.

- +

DFSShell

HDFS allows user data to be organized in the form of files and directories. It provides a commandline interface called DFSShell that lets a user interact with the data in HDFS. The syntax of this command set is similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample action/command pairs: @@ -537,7 +537,7 @@

DFSShell is targeted for applications that need a scripting language to interact with the stored data.

- +

DFSAdmin

The DFSAdmin command set is used for administering an HDFS cluster. These are commands that are used only by an HDFS administrator. Here are some sample action/command pairs: @@ -569,7 +569,7 @@ - +

Browser Interface

A typical HDFS install configures a web server to expose the HDFS namespace through a configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of its files using a web browser. @@ -577,10 +577,10 @@

- +

Space Reclamation

- +

File Deletes and Undeletes

When a file is deleted by a user or an application, it is not immediately removed from HDFS. Instead, HDFS first renames it to a file in the /trash directory. The file can be restored quickly as long as it remains in /trash. A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the Namenode deletes the file from the HDFS namespace. The deletion of a file causes the blocks associated with the file to be freed. Note that there could be an appreciable time delay between the time a file is deleted by a user and the time of the corresponding increase in free space in HDFS. @@ -588,7 +588,7 @@

A user can Undelete a file after deleting it as long as it remains in the /trash directory. If a user wants to undelete a file that he/she has deleted, he/she can navigate the /trash directory and retrieve the file. The /trash directory contains only the latest copy of the file that was deleted. The /trash directory is just like any other directory with one special feature: HDFS applies specified policies to automatically delete files from this directory. The current default policy is to delete files from /trash that are more than 6 hours old. In the future, this policy will be configurable through a well defined interface.

- +

Decrease Replication Factor

When the replication factor of a file is reduced, the Namenode selects excess replicas that can be deleted. The next Heartbeat transfers this information to the Datanode. The Datanode then removes the corresponding blocks and the corresponding free space appears in the cluster. Once again, there might be a time delay between the completion of the setReplication API call and the appearance of free space in the cluster. @@ -597,7 +597,7 @@ - +

References

Modified: hadoop/core/trunk/docs/hdfs_user_guide.html URL: http://svn.apache.org/viewvc/hadoop/core/trunk/docs/hdfs_user_guide.html?rev=616662&r1=616661&r2=616662&view=diff ============================================================================== --- hadoop/core/trunk/docs/hdfs_user_guide.html (original) +++ hadoop/core/trunk/docs/hdfs_user_guide.html Tue Jan 29 23:05:03 2008 @@ -220,7 +220,7 @@

- +

Purpose

@@ -235,7 +235,7 @@

- +

Overview

@@ -341,7 +341,7 @@

- +

Pre-requisites

@@ -370,7 +370,7 @@ machine.

- +

Web Interface

@@ -384,7 +384,7 @@ page).

- +

Shell Commands

@@ -400,7 +400,7 @@ changing file permissions, etc. It also supports a few HDFS specific operations like changing replication of files.

- +

DFSAdmin Command

@@ -433,7 +433,7 @@

- +

Secondary Namenode

@@ -458,7 +458,7 @@ specified in conf/masters file.

- +

Rebalancer

@@ -503,7 +503,7 @@ HADOOP-1652.

- +

Rack Awareness

@@ -522,7 +522,7 @@ HADOOP-692.

- +

Safemode

@@ -542,7 +542,7 @@ setSafeMode().

- +

Fsck

@@ -558,7 +558,7 @@ Fsck can be run on the whole filesystem or on a subset of files.

- +

Upgrade and Rollback

@@ -617,7 +617,7 @@

- +

File Permissions and Security

@@ -629,7 +629,7 @@ authentication and encryption of data transfers.

- +

Scalability

@@ -647,7 +647,7 @@ suggested configuration improvements for large Hadoop clusters.

- +

Introduction

@@ -303,30 +303,30 @@

- +

Feature List

- +

Simplified Interface for Provisioning Hadoop Clusters

By far, the biggest advantage of HOD is to quickly setup a Hadoop cluster. The user interacts with the cluster through a simple command line interface, the HOD client. HOD brings up a virtual MapReduce cluster with the required number of nodes, which the user can use for running Hadoop jobs. When done, HOD will automatically clean up the resources and make the nodes available again.

- +

Automatic installation of Hadoop

With HOD, Hadoop does not need to be even installed on the cluster. The user can provide a Hadoop tarball that HOD will automatically distribute to all the nodes in the cluster.

- +

Configuring Hadoop

Dynamic parameters of Hadoop configuration, such as the NameNode and JobTracker addresses and ports, and file system temporary directories are generated and distributed by HOD automatically to all nodes in the cluster. In addition, HOD allows the user to configure Hadoop parameters at both the server (for e.g. JobTracker) and client (for e.g. JobClient) level, including 'final' parameters, that were introduced with Hadoop 0.15.

- +

Auto-cleanup of Unused Clusters

HOD has an automatic timeout so that users cannot misuse resources they aren't using. The timeout applies only when there is no MapReduce job running.

- +

Log Services

HOD can be used to collect all MapReduce logs to a central location for archiving and inspection after the job is completed. @@ -334,13 +334,13 @@

- +

HOD Components

This is a brief overview of the various components of HOD and how they interact to provision Hadoop.

- +

HOD Client

The HOD client is a Unix command that users use to allocate Hadoop MapReduce clusters. The command provides other options to list allocated clusters and deallocate them. The HOD client generates the hadoop-site.xml in a user specified directory. The user can point to this configuration file while running Map/Reduce jobs on the allocated cluster. @@ -348,7 +348,7 @@

The nodes from where the HOD Client is run are called submit nodes because jobs are submitted to the resource manager system for allocating and running clusters from these nodes.

- +

RingMaster

The RingMaster is a HOD process that is started on one node per every allocated cluster. It is submitted as a 'job' to the resource manager by the HOD client. It controls which Hadoop daemons start on which nodes. It provides this information to other HOD processes, such as the HOD client, so users can also determine this information. The RingMaster is responsible for hosting and distributing the Hadoop tarball to all nodes in the cluster. It also automatically cleans up unused clusters. @@ -356,17 +356,17 @@

- +

HodRing

The HodRing is a HOD process that runs on every allocated node in the cluster. These processes are run by the RingMaster through the resource manager, using a facility of parallel execution. The HodRings are responsible for launching Hadoop commands on the nodes to bring up the Hadoop daemons. They get the command to launch from the RingMaster.

- +

Hodrc / HOD configuration file

An INI style configuration file where the users configure various options for the HOD system, including install locations of different software, resource manager parameters, log and temp file directories, parameters for their MapReduce jobs, etc.

- +

Submit Nodes and Compute Nodes

The nodes from where the HOD Client is run are referred as submit nodes because jobs are submitted to the resource manager system for allocating and running clusters from these nodes. @@ -377,17 +377,17 @@

- +

Getting Started with HOD

- +

Pre-Requisites

- +

Hardware

HOD requires a minimum of 3 nodes configured through a resource manager.

- +

Software

The following components are assumed to be installed before using HOD: @@ -424,7 +424,7 @@

HOD configuration requires the location of installs of these components to be the same on all nodes in the cluster. It will also make the configuration simpler to have the same location on the submit nodes.

- +

Resource Manager Configuration Pre-requisites

For using HOD with Torque: @@ -456,7 +456,7 @@ More information about setting up Torque can be found by referring to the documentation here.

- +

Setting up HOD

@@ -550,15 +550,15 @@

- +

Running HOD

- +

Overview

A typical session of HOD will involve atleast three steps: allocate, run hadoop jobs, deallocate.

- +

Operation allocate

The allocate operation is used to allocate a set of nodes and install and provision Hadoop on them. It has the following syntax: @@ -605,7 +605,7 @@ - +

Running Hadoop jobs using the allocated cluster

Now, one can run Hadoop jobs using the allocated cluster in the usual manner: @@ -631,7 +631,7 @@ - +

Operation deallocate

The deallocate operation is used to release an allocated cluster. When finished with a cluster, deallocate must be run so that the nodes become free for others to use. The deallocate operation has the following syntax: @@ -657,7 +657,7 @@ - +

Command Line Options

This section covers the major command line options available via the hod command: @@ -768,10 +768,10 @@

- +

HOD Configuration

- +

Introduction to HOD Configuration

Configuration options for HOD are organized as sections and options within them. They can be specified in two ways: a configuration file in the INI format, and as command line options to the HOD shell, specified in the format --section.option[=value]. If the same option is specified in both places, the value specified on the command line overrides the value in the configuration file. @@ -783,7 +783,7 @@

This section explains some of the most important or commonly used configuration options in some more detail.

- +

Categories / Sections in HOD Configuration

The following are the various sections in the HOD configuration: @@ -840,9 +840,9 @@ - +

Important and Commonly Used Configuration Options

- +

Common configuration options

Certain configuration options are defined in most of the sections of the HOD configuration. Options defined in a section, are used by the process for which that section applies. These options have the same meaning, but can have different values in each section. @@ -892,7 +892,7 @@ - +

hod options

@@ -918,7 +918,7 @@

- +

resource_manager options

@@ -951,7 +951,7 @@

- +

ringmaster options

@@ -970,7 +970,7 @@

- +

gridservice-hdfs options

@@ -1037,7 +1037,7 @@

- +

gridservice-mapred options

Modified: hadoop/core/trunk/docs/mapred_tutorial.html URL: http://svn.apache.org/viewvc/hadoop/core/trunk/docs/mapred_tutorial.html?rev=616662&r1=616661&r2=616662&view=diff ============================================================================== --- hadoop/core/trunk/docs/mapred_tutorial.html (original) +++ hadoop/core/trunk/docs/mapred_tutorial.html Tue Jan 29 23:05:03 2008 @@ -280,7 +280,7 @@ Example: WordCount v2.0

-Source Code +Source Code
Sample Runs @@ -294,7 +294,7 @@ - +
Purpose

This document comprehensively describes all user-facing facets of the @@ -303,7 +303,7 @@
- +
Pre-requisites

Ensure that Hadoop is installed, configured and is running. More @@ -323,7 +323,7 @@
- +
Overview

Hadoop Map-Reduce is a software framework for easily writing @@ -381,7 +381,7 @@
- +
Inputs and Outputs

The Map-Reduce framework operates exclusively on @@ -415,7 +415,7 @@
- +
Example: WordCount v1.0

Before we jump into the details, lets walk through an example Map-Reduce @@ -428,7 +428,7 @@ pseudo-distributed or fully-distributed Hadoop installation.
- +
Source Code

@@ -991,7 +991,7 @@

- +

Usage

Assuming HADOOP_HOME is the root of the installation and HADOOP_VERSION is the Hadoop version installed, compile @@ -1086,7 +1086,7 @@

- +

Walk-through

The WordCount application is quite straight-forward.

The Mapper implementation (lines 14-26), via the @@ -1196,7 +1196,7 @@

- +

Map-Reduce - User Interfaces

This section provides a reasonable amount of detail on every user-facing @@ -1215,12 +1215,12 @@

Finally, we will wrap up by discussing some useful features of the framework such as the DistributedCache, IsolationRunner etc.

- +

Payload

Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods. These form the core of the job.

- +

Mapper

@@ -1276,7 +1276,7 @@ CompressionCodec to be used via the JobConf.

- +

How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

@@ -1289,7 +1289,7 @@ setNumMapTasks(int) (which only provides a hint to the framework) is used to set it even higher.

- +

Reducer

@@ -1312,18 +1312,18 @@

Reducer has 3 primary phases: shuffle, sort and reduce.

- +

Shuffle

Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.

- +

Sort

The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.

The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

- +

Secondary Sort

If equivalence rules for grouping the intermediate keys are required to be different from those for grouping keys before @@ -1334,7 +1334,7 @@ JobConf.setOutputKeyComparatorClass(Class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate secondary sort on values.

- +

Reduce

In this phase the @@ -1350,7 +1350,7 @@ progress, set application-level status messages and update Counters, or just indicate that they are alive.

The output of the Reducer is not sorted.

- +

How Many Reduces?

The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * @@ -1365,7 +1365,7 @@

The scaling factors above are slightly less than whole numbers to reserve a few reduce slots in the framework for speculative-tasks and failed tasks.

- +

Reducer NONE

It is legal to set the number of reduce-tasks to zero if no reduction is desired.

@@ -1375,7 +1375,7 @@ setOutputPath(Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

- +

Partitioner

@@ -1389,7 +1389,7 @@

HashPartitioner is the default Partitioner.

- +

Reporter

@@ -1408,7 +1408,7 @@

Applications can also update Counters using the Reporter.

- +

OutputCollector

@@ -1419,7 +1419,7 @@

Hadoop Map-Reduce comes bundled with a library of generally useful mappers, reducers, and partitioners.

- +

Job Configuration

@@ -1474,7 +1474,7 @@ set(String, String)/get(String, String) to set/get arbitrary parameters needed by applications. However, use the DistributedCache for large amounts of (read-only) data.

- +

Task Execution & Environment

The TaskTracker executes the Mapper/ Reducer task as a child process in a separate jvm. @@ -1534,7 +1534,7 @@ loaded via System.loadLibrary or System.load.

- +

Job Submission and Monitoring

@@ -1570,7 +1570,7 @@

Normally the user creates the application, describes various facets of the job via JobConf, and then uses the JobClient to submit the job and monitor its progress.

- +

Job Control

Users may need to chain map-reduce jobs to accomplish complex tasks which cannot be done via a single map-reduce job. This is fairly @@ -1606,7 +1606,7 @@ - +

Job Input

@@ -1654,7 +1654,7 @@ appropriate CompressionCodec. However, it must be noted that compressed files with the above extensions cannot be split and each compressed file is processed in its entirety by a single mapper.

- +

InputSplit

@@ -1668,7 +1668,7 @@ FileSplit is the default InputSplit. It sets map.input.file to the path of the input file for the logical split.

- +

RecordReader

@@ -1680,7 +1680,7 @@ for processing. RecordReader thus assumes the responsibility of processing record boundaries and presents the tasks with keys and values.

- +

Job Output

@@ -1705,7 +1705,7 @@

TextOutputFormat is the default OutputFormat.

- +

Task Side-Effect Files

In some applications, component tasks need to create and/or write to side-files, which differ from the actual job-output files.

@@ -1731,7 +1731,7 @@ JobConf.getOutputPath(), and the framework will promote them similarly for succesful task-attempts, thus eliminating the need to pick unique paths per task-attempt.

- +

RecordWriter

@@ -1739,9 +1739,9 @@ pairs to an output file.

RecordWriter implementations write the job outputs to the FileSystem.

- +

Other Useful Features

- +

Counters

Counters represent global counters, defined either by @@ -1755,7 +1755,7 @@ Reporter.incrCounter(Enum, long) in the map and/or reduce methods. These counters are then globally aggregated by the framework.

- +

DistributedCache

@@ -1788,7 +1788,7 @@ DistributedCache.createSymlink(Path, Configuration) api. Files have execution permissions set.

- +

Tool

The Tool interface supports the handling of generic Hadoop command-line options. @@ -1828,7 +1828,7 @@

- +

IsolationRunner

@@ -1852,13 +1852,13 @@

IsolationRunner will run the failed task in a single jvm, which can be in the debugger, over precisely the same input.

- +

JobControl

JobControl is a utility which encapsulates a set of Map-Reduce jobs and their dependencies.

- +

Data Compression

Hadoop Map-Reduce provides facilities for the application-writer to specify compression for both intermediate map-outputs and the @@ -1872,7 +1872,7 @@ codecs for reasons of both performance (zlib) and non-availability of Java libraries (lzo). More details on their usage and availability are available here.

- +

Intermediate Outputs

Applications can control compression of intermediate map-outputs via the @@ -1893,7 +1893,7 @@ JobConf.setMapOutputCompressionType(SequenceFile.CompressionType) api.

- +

Job Outputs

Applications can control compression of job-outputs via the @@ -1913,7 +1913,7 @@

- +

Example: WordCount v2.0

Here is a more complete WordCount which uses many of the @@ -1923,7 +1923,7 @@ pseudo-distributed or fully-distributed Hadoop installation.

- +

Source Code

@@ -3133,7 +3133,7 @@

- +

Sample Runs

Sample text-files as input:

@@ -3301,7 +3301,7 @@

- +

Highlights

The second version of WordCount improves upon the previous one by using some features offered by the Map-Reduce framework: Modified: hadoop/core/trunk/docs/native_libraries.html URL: http://svn.apache.org/viewvc/hadoop/core/trunk/docs/native_libraries.html?rev=616662&r1=616661&r2=616662&view=diff ============================================================================== --- hadoop/core/trunk/docs/native_libraries.html (original) +++ hadoop/core/trunk/docs/native_libraries.html Tue Jan 29 23:05:03 2008 @@ -190,7 +190,7 @@

- +

Purpose

Hadoop has native implementations of certain components for reasons of @@ -201,7 +201,7 @@

- +

Components

Hadoop currently has the following @@ -227,7 +227,7 @@

- +

Usage

It is fairly simple to use the native hadoop libraries:

@@ -281,7 +281,7 @@

- +

Supported Platforms

Hadoop native library is supported only on *nix platforms only. @@ -311,7 +311,7 @@

- +

Building Native Hadoop Libraries

Hadoop native library is written in @@ -360,7 +360,7 @@

where <platform> is combination of the system-properties: ${os.name}-${os.arch}-${sun.arch.data.model}; for e.g. Linux-i386-32.

- +

Notes

Modified: hadoop/core/trunk/docs/quickstart.html URL: http://svn.apache.org/viewvc/hadoop/core/trunk/docs/quickstart.html?rev=616662&r1=616661&r2=616662&view=diff ============================================================================== --- hadoop/core/trunk/docs/quickstart.html (original) +++ hadoop/core/trunk/docs/quickstart.html Tue Jan 29 23:05:03 2008 @@ -215,7 +215,7 @@

- +

Purpose

The purpose of this document is to help users get a single-node Hadoop @@ -227,10 +227,10 @@

- +

Pre-requisites

- +

Supported Platforms

@@ -245,7 +245,7 @@ - +

Required Software

@@ -262,7 +262,7 @@ - +

Additional requirements for Windows

@@ -273,7 +273,7 @@ - +

Installing Software

If your cluster doesn't have the requisite software you will need to install it.

@@ -296,7 +296,7 @@

- +

Download

@@ -318,7 +318,7 @@

- +

Standalone Operation

By default, Hadoop is configured to run things in a non-distributed @@ -346,12 +346,12 @@

- +

Pseudo-Distributed Operation

Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.

- +

Configuration

Use the following conf/hadoop-site.xml:

@@ -417,7 +417,7 @@

- +

Setup passphraseless ssh

Now check that you can ssh to the localhost without a passphrase:
@@ -435,7 +435,7 @@ $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

- +

Execution

Format a new distributed-filesystem:
@@ -512,7 +512,7 @@

- +

Fully-Distributed Operation

Information on setting up fully-distributed non-trivial clusters Modified: hadoop/core/trunk/docs/streaming.html URL: http://svn.apache.org/viewvc/hadoop/core/trunk/docs/streaming.html?rev=616662&r1=616661&r2=616662&view=diff ============================================================================== --- hadoop/core/trunk/docs/streaming.html (original) +++ hadoop/core/trunk/docs/streaming.html Tue Jan 29 23:05:03 2008 @@ -253,7 +253,7 @@

- +

Hadoop Streaming

@@ -269,7 +269,7 @@

- +

How Does Streaming Work

@@ -298,7 +298,7 @@

- +

Package Files With Job Submissions

@@ -330,10 +330,10 @@

- +

Streaming Options and Usage

- +

Mapper-Only Jobs

Often, you may want to process input data using a map function only. To do this, simply set mapred.reduce.tasks to zero. The map/reduce framework will not create any reducer tasks. Rather, the outputs of the mapper tasks will be the final output of the job. @@ -341,7 +341,7 @@

To be backward compatible, Hadoop Streaming also supports the "-reduce NONE" option, which is equivalent to "-jobconf mapred.reduce.tasks=0".

- +

Specifying Other Plugins for Jobs

Just as with a normal map/reduce job, you can specify other plugins for a streaming job: @@ -358,7 +358,7 @@

The class you supply for the output format is expected to take key/value pairs of Text class. If you do not specify an output format class, the TextOutputFormat is used as the default.

- +

Large files and archives in Hadoop Streaming

The -cacheFile and -cacheArchive options allow you to make files and archives available to the tasks. The argument is a URI to the file or archive that you have already uploaded to HDFS. These files and archives are cached across jobs. You can retrieve the host and fs_port values from the fs.default.name config variable. @@ -427,7 +427,7 @@ This is just the second cache string - +

Specifying Additional Configuration Variables for Jobs

You can specify additional configuration variables by using "-jobconf <n>=<v>". For example: @@ -446,7 +446,7 @@

For more details on the jobconf parameters see: http://wiki.apache.org/hadoop/JobConfFile

- +

Other Supported Options

Other options you may specify for a streaming job are described here: @@ -528,10 +528,10 @@

- +

More usage examples

- +

Customizing the Way to Split Lines into Key/Value Pairs

As noted earlier, when the map/reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. By default, the prefix of the line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value. @@ -554,7 +554,7 @@

Similarly, you can use "-jobconf stream.reduce.output.field.separator=SEP" and "-jobconf stream.num.reduce.output.fields=NUM" to specify the nth field separator in a line of the reduce outputs as the separator between the key and the value.

- +

A Useful Partitioner Class (secondary sort, the -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner option)

Hadoop has a library class, org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner, that is useful for many applications. This class allows the map/reduce framework to partition the map outputs based on prefixes of keys, not the whole keys. For example: @@ -614,7 +614,7 @@ 11.14.2.2 11.14.2.3 - +

Working with the Hadoop Aggregate Package (the -reduce aggregate option)

Hadoop has a library package called "Aggregate" (https://svn.apache.org/repos/asf/hadoop/core/trunk/src/java/org/apache/hadoop/mapred/lib/aggregate). Aggregate provides a special reducer class and a special combiner class, and a list of simple aggregators that perform aggregations such as "sum", "max", "min" and so on over a sequence of values. Aggregate allows you to define a mapper plugin class that is expected to generate "aggregatable items" for each input key/value pair of the mappers. The combiner/reducer will aggregate those aggregatable items by invoking the appropriate aggregators. @@ -655,7 +655,7 @@ if __name__ == "__main__": main(sys.argv) - +

Field Selection ( similar to unix 'cut' command)

Hadoop has a library class, org.apache.hadoop.mapred.lib.FieldSelectionMapReduce, that effectively allows you to process text data like the unix "cut" utility. The map function defined in the class treats each input key/value pair as a list of fields. You can specify the field separator (the default is the tab character). You can select an arbitrary list of fields as the map output key, and an arbitrary list of fields as the map output value. Similarly, the reduce function defined in the class treats each input key/value pair as a list of fields. You can select an arbitrary list of fields as the reduce output key, and an arbitrary list of fields as the reduce output value. For example: @@ -684,15 +684,15 @@

- +

Frequently Asked Questions

- +

How do I use Hadoop Streaming to run an arbitrary set of (semi-)independent tasks?

Often you do not need the full power of Map Reduce, but only need to run multiple instances of the same program - either on different parts of the data, or on the same data, but with different parameters. You can use Hadoop Streaming to do this.

- +

How do I process files, one per map?

As an example, consider the problem of zipping (compressing) a set of files across the hadoop cluster. You can achieve this using either of these methods: @@ -736,13 +736,13 @@ - +

How many reducers should I use?

See the Hadoop Wiki for details: http://wiki.apache.org/hadoop/HowManyMapsAndReduces

- +

If I set up an alias in my shell script, will that work after -mapper, i.e. say I do: alias c1='cut -f1'. Will -mapper "c1" work?

Using an alias will not work, but variable substitution is allowed as shown in this example: @@ -769,12 +769,12 @@ 75 80 - +

Can I use UNIX pipes? For example, will -mapper "cut -f1 | sed s/foo/bar/g" work?

Currently this does not work and gives an "java.io.IOException: Broken pipe" error. This is probably a bug that needs to be investigated.

- +

When I run a streaming job by distributing large executables (for example, 3.6G) through the -file option, I get a "No space left on device" error. What do I do?

The jar packaging happens in a directory pointed to by the configuration variable stream.tmpdir. The default value of stream.tmpdir is /tmp. Set the value to a directory with more space: @@ -782,7 +782,7 @@

 -jobconf stream.tmpdir=/export/bigspace/...

- +

How do I specify multiple input directories?

You can specify multiple input directories with multiple '-input' options: @@ -790,17 +790,17 @@

  hadoop jar hadoop-streaming.jar -input '/user/foo/dir1' -input '/user/foo/dir2'

- +

How do I generate output files with gzip format?

Instead of plain text files, you can generate gzip files as your generated output. Pass '-jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCode' as option to your streaming job.

- +

How do I provide my own input/output format with streaming?

At least as late as version 0.14, Hadoop does not support multiple jar files. So, when specifying your own custom classes you will have to pack them along with the streaming jar and use the custom jar instead of the default hadoop streaming jar.

- +

How do I parse XML documents using streaming?

You can use the record reader StreamXmlRecordReader to process XML documents. Modified: hadoop/core/trunk/src/docs/src/documentation/content/xdocs/tabs.xml URL: http://svn.apache.org/viewvc/hadoop/core/trunk/src/docs/src/documentation/content/xdocs/tabs.xml?rev=616662&r1=616661&r2=616662&view=diff ============================================================================== --- hadoop/core/trunk/src/docs/src/documentation/content/xdocs/tabs.xml (original) +++ hadoop/core/trunk/src/docs/src/documentation/content/xdocs/tabs.xml Tue Jan 29 23:05:03 2008 @@ -18,8 +18,8 @@ -

Purpose

Pre-requisites

Installation

Configuration

Configuration Files

Site Configuration

Configuring the Environment of the Hadoop Daemons

Configuring the Hadoop Daemons

Real-World Cluster Configurations

Slaves

Logging

Hadoop Startup

Hadoop Shutdown

Introduction

Assumptions and Goals

Hardware Failure

Streaming Data Access

Large Data Sets

Simple Coherency Model

“Moving Computation is Cheaper than Moving Data”

Portability Across Heterogeneous Hardware and Software Platforms

Namenode and Datanodes

The File System Namespace

Data Replication

Replica Placement: The First Baby Steps

Replica Selection

SafeMode

The Persistence of File System Metadata

The Communication Protocols

Robustness

Data Disk Failure, Heartbeats and Re-Replication

Cluster Rebalancing

Data Integrity

Metadata Disk Failure

Snapshots

Data Organization

Data Blocks

Staging

Replication Pipelining

Accessibility

DFSShell

DFSAdmin

Browser Interface

Space Reclamation

File Deletes and Undeletes

Decrease Replication Factor

References

Purpose

Overview

Pre-requisites

Web Interface

Shell Commands

DFSAdmin Command

Secondary Namenode

Rebalancer

Rack Awareness

Safemode

Fsck

Upgrade and Rollback

File Permissions and Security

Scalability

Related Documentation

Introduction

Feature List

Simplified Interface for Provisioning Hadoop Clusters

Automatic installation of Hadoop

Configuring Hadoop

Auto-cleanup of Unused Clusters

Log Services

HOD Components

HOD Client

RingMaster

HodRing

Hodrc / HOD configuration file

Submit Nodes and Compute Nodes

Getting Started with HOD

Pre-Requisites

Hardware

Software

Resource Manager Configuration Pre-requisites