hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Owen O'Malley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-2046) Documentation: Hadoop Install/Configuration Guide and Map-Reduce User Manual
Date Thu, 18 Oct 2007 21:20:50 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536054
] 

Owen O'Malley commented on HADOOP-2046:
---------------------------------------

I agree that this is good overall. More items:
* In Configuration, the proper way to get un-substituted values is with getRaw, not getObject,
which is deprecated. 
* I'd add a better discussion of the set/getOutputValueGroupingComparator. Something like
my message to hadoop-user on the topic:
{quote}
There is not a guarantee of the reduce sort being stable in any sense. (WIth the non-deterministic
order of the map outputs being available to the reduce, it wouldn't make that much sense.)

There certainly isn't enough documentation about what is allowed for sorting. I've filed a
bug HADOOP-1981 to expand the Reducer java doc to mention the JobConf methods that can control
the sort order. In particular, the methods are:

setOutputKeyComparatorClass
setOutputValueGroupingComparator

The first comparator controls the sort order of the keys. The second controls which keys are
grouped together into a single call to the reduce method. The combination of these two allows
you to set up jobs that act like you've defined an order on the values.

For example, say that you want to find duplicate web pages and tag them all with the url of
the "best" known example. You would set up the job like:

Map Input Key: url
Map Input Value: document
Map Output Key: document checksum, url pagerank
Map Output Value: url
Partitioner: by checksum
OutputKeyComparator: by checksum and then decreasing pagerank
OutputValueGroupingComparator: by checksum

with this setup, the reduce function will be called exactly once with each checksum, but the
first value from the iterator will be the one with the highest pagerank, which can then be
used to tag the other entries of the checksum family.
{quote}

> Documentation: Hadoop Install/Configuration Guide and Map-Reduce User Manual
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-2046
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2046
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: documentation
>    Affects Versions: 0.14.2
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-2046_1_20071018.patch
>
>
> I'd like to put forward some thoughts on how to structure reasonably detailed documentation
for hadoop.
> Essentially I think of atleast 3 different profiles to target:
> * hadoop-dev, folks who are actively involved improving/fixing hadoop.
> * hadoop-user
> ** mapred application writers and/or folks who directly use hdfs
> ** hadoop cluster administrators
> For this issue, I'd like to first target the latter category (admin and hdfs/mapred user)
- where, arguably, is the biggest bang for the buck, right now. 
> There is a crying need to get user-level stuff documented, judging by the sheer no. of
emails we get on the hadoop lists...
> ----
> *1. Installing/Configuration Guides*
> This set of documents caters to folks ranging from someone just playing with hadoop on
a single-node to operations teams who administer hadoop on several nodes (thousands). To ensure
we cover all bases I'm thinking along the lines of:
> * _Download, install and configure hadoop_ on a single-node cluster: including a few
comments on how to run examples (word-count) etc.
> * *Admin Guide*: Install and configure a real, distributed cluster. 
> * *Tune Hadoop*: Separate sections on how to tune hdfs and map-reduce, targeting power
admins/users.
> I reckon most of this would be done via forrest, with appropriate links to javadoc.
> ---
> *2. User Manual*
> This set is geared for people who use hdfs and/or map-reduce per-se. Stuff to document:
> * Write a really simple mapred application, just fitting the blocks together i.e. maybe
a walk-through of a couple of examples like word-count, sort etc.
> * Detailed information on important map-reduce user-interfaces:
> *- JobConf
> *- JobClient
> *- Tool & ToolRunner
> *- InputFormat 
> *-- InputSplit
> *-- RecordReader
> *- Mapper
> *- Reducer
> *- Reporter
> *- OutputCollector
> *- Writable
> *- WritableComparable
> *- OutputFormat
> *- DistributedCache
> * SequenceFile
> *- Compression types: NONE, RECORD, BLOCK
> * Hadoop Streaming
> * Hadoop Pipes
> I reckon most of this would land up in the javadocs, specifically package.html and some
via forrest.
> ----
> Also, as discussed in HADOOP-1881, it would be quite useful to maintain documentation
per-release, even on the hadoop website i.e. we could have a main documentation page link
to documentation per-release and to the trunk.
> ----
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message