Return-Path: Delivered-To: apmail-hadoop-core-commits-archive@www.apache.org Received: (qmail 71477 invoked from network); 13 Feb 2008 17:21:27 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 13 Feb 2008 17:21:27 -0000 Received: (qmail 97093 invoked by uid 500); 13 Feb 2008 17:21:21 -0000 Delivered-To: apmail-hadoop-core-commits-archive@hadoop.apache.org Received: (qmail 97072 invoked by uid 500); 13 Feb 2008 17:21:21 -0000 Mailing-List: contact core-commits-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-commits@hadoop.apache.org Received: (qmail 97063 invoked by uid 99); 13 Feb 2008 17:21:21 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2008 09:21:21 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.130] (HELO eos.apache.org) (140.211.11.130) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Feb 2008 17:20:47 +0000 Received: from eos.apache.org (localhost [127.0.0.1]) by eos.apache.org (Postfix) with ESMTP id EB19AD2EB for ; Wed, 13 Feb 2008 17:20:54 +0000 (GMT) Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: Apache Wiki To: core-commits@hadoop.apache.org Date: Wed, 13 Feb 2008 17:20:54 -0000 Message-ID: <20080213172054.26932.84978@eos.apache.org> Subject: [Hadoop Wiki] Update of "FAQ" by DougCutting X-Virus-Checked: Checked by ClamAV on apache.org Dear Wiki user, You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification. The following page has been changed by DougCutting: http://wiki.apache.org/hadoop/FAQ The comment on the change is: update for TLP move ------------------------------------------------------------------------------ [[Anchor(1)]] '''1. [#1 What is Hadoop?]''' - [http://lucene.apache.org/hadoop/ Hadoop] is a distributed computing platform written in Java. It incorporates features similar to those of the [http://en.wikipedia.org/wiki/Google_File_System Google File System] and of [http://en.wikipedia.org/wiki/MapReduce MapReduce]. For some details, see HadoopMapReduce. + [http://hadoop.apache.org/core/ Hadoop] is a distributed computing platform written in Java. It incorporates features similar to those of the [http://en.wikipedia.org/wiki/Google_File_System Google File System] and of [http://en.wikipedia.org/wiki/MapReduce MapReduce]. For some details, see HadoopMapReduce. [[BR]] [[Anchor(2)]] @@ -67, +67 @@ No. There are several ways to incorporate non-Java code. * HadoopStreaming permits any shell command to be used as a map or reduce function. - * [http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/c%2B%2B/libhdfs libhdfs], a JNI-based C API for talking to hdfs (only). + * [http://svn.apache.org/viewvc/hadoop/core/trunk/src/c%2B%2B/libhdfs libhdfs], a JNI-based C API for talking to hdfs (only). - * [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/pipes/package-summary.html Hadoop Pipes], a [http://www.swig.org/ SWIG]-compatible C++ API (non-JNI) to write map-reduce jobs. + * [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html Hadoop Pipes], a [http://www.swig.org/ SWIG]-compatible C++ API (non-JNI) to write map-reduce jobs. [[BR]] [[Anchor(5)]] '''5. [#5 How can I help to make Hadoop better?]''' - If you have trouble figuring how to use Hadoop, then, once you've figured something out (perhaps with the help of the [http://lucene.apache.org/hadoop/mailing_lists.html mailing lists]), pass that knowledge on to others by adding something to this wiki. + If you have trouble figuring how to use Hadoop, then, once you've figured something out (perhaps with the help of the [http://hadoop.apache.org/core/mailing_lists.html mailing lists]), pass that knowledge on to others by adding something to this wiki. If you find something that you wish were done better, and know how to fix it, read HowToContribute, and contribute a patch. @@ -124, +124 @@ [[Anchor(9)]] '''9. [#9 MR. Can I write create/write-to hdfs files directly from my map/reduce tasks?]''' - Yes. (Clearly, you want this since you need to create/write-to files other than the output-file written out by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/OutputCollector.html OutputCollector].) + Yes. (Clearly, you want this since you need to create/write-to files other than the output-file written out by [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html OutputCollector].) Caveats: - ${mapred.output.dir} is the eventual output directory for the job ([http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path) JobConf.setOutputPath] / [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath() JobConf.getOutputPath]). + ${mapred.output.dir} is the eventual output directory for the job ([http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path) JobConf.setOutputPath] / [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#getOutputPath() JobConf.getOutputPath]). ${taskid} is the actual id of the individual task-attempt (e.g. task_200709221812_0001_m_000000_0), a TIP is a bunch of ${taskid}s (e.g. task_200709221812_0001_m_000000). @@ -142, +142 @@ The application-writer can take advantage of this by creating any side-files required in ${mapred.output.dir} during execution of his reduce-task, and the framework will move them out similarly - thus you don't have to pick unique paths per task-attempt. - Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path) JobConf.setOutputPath]. ''So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature.'' + Fine-print: the value of ${mapred.output.dir} during execution of a particular task-attempt is actually ${mapred.output.dir}/_{$taskid}, not the value set by [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputPath(org.apache.hadoop.fs.Path) JobConf.setOutputPath]. ''So, just create any hdfs files you want in ${mapred.output.dir} from your reduce task to take advantage of this feature.'' The entire discussion holds true for maps of jobs with reducer=NONE (i.e. 0 reduces) since output of the map, in that case, goes directly to hdfs. @@ -151, +151 @@ [[Anchor(10)]] '''10. [#10 MR. How do I get each of my maps to work on one complete input-file and not allow the framework to split-up my files?]''' - Essentially a job's input is represented by the [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/InputFormat.html InputFormat](interface)/[http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/FileInputFormat.html FileInputFormat](base class). + Essentially a job's input is represented by the [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/InputFormat.html InputFormat](interface)/[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html FileInputFormat](base class). - For this purpose one would need a 'non-splittable' [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/FileInputFormat.html FileInputFormat] i.e. an input-format which essentially tells the map-reduce framework that it cannot be split-up and processed. To do this you need your particular input-format to return '''false''' for the [http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path) isSplittable] call. + For this purpose one would need a 'non-splittable' [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html FileInputFormat] i.e. an input-format which essentially tells the map-reduce framework that it cannot be split-up and processed. To do this you need your particular input-format to return '''false''' for the [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileInputFormat.html#isSplitable(org.apache.hadoop.fs.FileSystem,%20org.apache.hadoop.fs.Path) isSplittable] call. - E.g. '''org.apache.hadoop.mapred.Sort``Validator.Record``Stats``Checker.Non``Splitable``Sequence``File``Input``Format''' in [http://svn.apache.org/viewvc/lucene/hadoop/trunk/src/test/org/apache/hadoop/mapred/SortValidator.java src/test/org/apache/hadoop/mapred/SortValidator.java] + E.g. '''org.apache.hadoop.mapred.Sort``Validator.Record``Stats``Checker.Non``Splitable``Sequence``File``Input``Format''' in [http://svn.apache.org/viewvc/hadoop/core/trunk/src/test/org/apache/hadoop/mapred/SortValidator.java src/test/org/apache/hadoop/mapred/SortValidator.java] In addition to implementing the InputFormat interface and having isSplitable(...) returning false, it is also necessary to implement the RecordReader interface for returning the whole content of the input file. (default is LineRecordReader, which splits the file into separate lines) - The other, quick-fix option, is to set [http://lucene.apache.org/hadoop/hadoop-default.html#mapred.min.split.size mapred.min.split.size] to large enough value. + The other, quick-fix option, is to set [http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.min.split.size mapred.min.split.size] to large enough value. [[BR]] @@ -179, +179 @@ Depending on how safe mode parameters are configured the name-node will stay in safe mode until a specific percentage of blocks of the system is ''minimally'' replicated - [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.replication.min dfs.replication.min]. + [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.replication.min dfs.replication.min]. If the safe mode threshold - [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.safemode.threshold.pct dfs.safemode.threshold.pct] + [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.safemode.threshold.pct dfs.safemode.threshold.pct] is set to 1 then all blocks of all files should be minimally replicated. @@ -189, +189 @@ order to replicate them the name-node needs to leave safe mode. Learn more about safe mode - [http://lucene.apache.org/hadoop/api/org/apache/hadoop/dfs/NameNode.html#setSafeMode(org.apache.hadoop.dfs.FSConstants.SafeModeAction) here]. + [http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/dfs/NameNode.html#setSafeMode(org.apache.hadoop.dfs.FSConstants.SafeModeAction) here]. [[BR]] [[Anchor(13)]] '''13. [#13 MR. I see a maximum of 2 maps/reduces spawned concurrently on each TaskTracker, how do I increase that?]''' - Use the configuration knob: [http://lucene.apache.org/hadoop/hadoop-default.html#mapred.tasktracker.tasks.maximum mapred.tasktracker.tasks.maximum] to control the number of maps/reduces spawned simultaneously on a !TaskTracker. By default, it is set to ''2'', hence one sees a maximum of 2 maps and 2 reduces at a given instance on a !TaskTracker. + Use the configuration knob: [http://hadoop.apache.org/core/docs/current/hadoop-default.html#mapred.tasktracker.tasks.maximum mapred.tasktracker.tasks.maximum] to control the number of maps/reduces spawned simultaneously on a !TaskTracker. By default, it is set to ''2'', hence one sees a maximum of 2 maps and 2 reduces at a given instance on a !TaskTracker. Caveats: * ''mapred.tasktracker.tasks.maximum'' that is a cluster-wide limit i.e. controlled at the !JobTracker end. [http://issues.apache.org/jira/browse/HADOOP-1245 HADOOP-1245] should fix that. @@ -232, +232 @@ ''Data-nodes'' can store blocks in multiple directories typically allocated on different local disk drives. In order to setup multiple directories one needs to specify a comma separated list of pathnames as a value of the configuration parameter - [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.data.dir dfs.data.dir]. + [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.data.dir dfs.data.dir]. Data-nodes will attempt to place equal amount of data in each of the directories. The ''name-node'' also supports multiple directories, which in the case store the name space image and the edits log. The directories are specified via the - [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.name.dir dfs.name.dir] + [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.name.dir dfs.name.dir] configuration parameter. The name-node directories are used for the name space data replication so that the image and the log could be restored from the remaining volumes if one of them fails. @@ -264, +264 @@ Hadoop offers the ''decommission'' feature to retire a set of existing data-nodes. The nodes to be retired should be included into the ''exclude file'', and the exclude file name should be specified as a configuration parameter - [http://lucene.apache.org/hadoop/hadoop-default.html#dfs.hosts.exclude dfs.hosts.exclude]. + [http://hadoop.apache.org/core/docs/current/hadoop-default.html#dfs.hosts.exclude dfs.hosts.exclude]. Then the shell command {{{ bin/hadoop dfsadmin -refreshNodes