hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Cutting (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2046) Documentation: Hadoop Install/Configuration Guide and Map-Reduce User Manual
Date Wed, 17 Oct 2007 20:30:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12535725

Doug Cutting commented on HADOOP-2046:

Overall this looks great.  A few comments:

- In Configuration.java, the first use of 'final' should be in italics, not bold, and the
anchors in the headers should be done with <h4 id=foo>Foo</h4>.  I also find the
links to String and Path mostly just introduce noise.  We might make the first reference to
Path a link, but leave the rest as plain text: no one is going to click on that link to find
out what a Java String is, nor do we need more than a single link to Path.

- In JobClient.java, the anchors should be implemented with 'id='.  We should not mention
HDFS here: the system directory could be in, e.g., KFS.  I would also leave the internally
used file names "job.jar" and "job.xml" out of this description.  The list of things done
should include 'submission of the job to the jobtracker'.  The steps you list are all preparations
for that, but we don't want to forget that crucial step.  In the list of ways to handle job
sequencing, it should be made more clear that these are alternatives: one should choose just
one method.  Also, should we mention the jobcontrol stuff here?

- in JobConf.java: the JobConf isn't XML.  It can be serialized as XML, but it's fundamentally
a Map<String,String>, a Configuration.  We also have anchors that should use 'id=' here,
and mentions of HDFS that should be instead just be to FileSystem (all FileSystem's have a
block size, that's used to generate splits).  And, instead of 'default InputFormat' we should
say 'standard file-based InputFormats'.  We should probably also include something at the
top-level in this class about the determination of job jar file.

> Documentation: Hadoop Install/Configuration Guide and Map-Reduce User Manual
> ----------------------------------------------------------------------------
>                 Key: HADOOP-2046
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2046
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.14.2
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>            Priority: Critical
>             Fix For: 0.15.0
>         Attachments: HADOOP-2046_1_20071018.patch
> I'd like to put forward some thoughts on how to structure reasonably detailed documentation
for hadoop.
> Essentially I think of atleast 3 different profiles to target:
> * hadoop-dev, folks who are actively involved improving/fixing hadoop.
> * hadoop-user
> ** mapred application writers and/or folks who directly use hdfs
> ** hadoop cluster administrators
> For this issue, I'd like to first target the latter category (admin and hdfs/mapred user)
- where, arguably, is the biggest bang for the buck, right now. 
> There is a crying need to get user-level stuff documented, judging by the sheer no. of
emails we get on the hadoop lists...
> ----
> *1. Installing/Configuration Guides*
> This set of documents caters to folks ranging from someone just playing with hadoop on
a single-node to operations teams who administer hadoop on several nodes (thousands). To ensure
we cover all bases I'm thinking along the lines of:
> * _Download, install and configure hadoop_ on a single-node cluster: including a few
comments on how to run examples (word-count) etc.
> * *Admin Guide*: Install and configure a real, distributed cluster. 
> * *Tune Hadoop*: Separate sections on how to tune hdfs and map-reduce, targeting power
> I reckon most of this would be done via forrest, with appropriate links to javadoc.
> ---
> *2. User Manual*
> This set is geared for people who use hdfs and/or map-reduce per-se. Stuff to document:
> * Write a really simple mapred application, just fitting the blocks together i.e. maybe
a walk-through of a couple of examples like word-count, sort etc.
> * Detailed information on important map-reduce user-interfaces:
> *- JobConf
> *- JobClient
> *- Tool & ToolRunner
> *- InputFormat 
> *-- InputSplit
> *-- RecordReader
> *- Mapper
> *- Reducer
> *- Reporter
> *- OutputCollector
> *- Writable
> *- WritableComparable
> *- OutputFormat
> *- DistributedCache
> * SequenceFile
> *- Compression types: NONE, RECORD, BLOCK
> * Hadoop Streaming
> * Hadoop Pipes
> I reckon most of this would land up in the javadocs, specifically package.html and some
via forrest.
> ----
> Also, as discussed in HADOOP-1881, it would be quite useful to maintain documentation
per-release, even on the hadoop website i.e. we could have a main documentation page link
to documentation per-release and to the trunk.
> ----
> Thoughts?

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message