hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arkady Borkovsky <ark...@yahoo-inc.com>
Subject Re: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner
Date Tue, 10 Apr 2007 20:24:20 GMT
To extend this,
I'd suggest that Hadoop Streaming is interfaced in the following way:

Map reduce process is parameterized by several algorithms.
This includes at least
1. mapper
2. reducer  (including special case of NONE)
3. input format
4. input partitioner
5. splitter for reduce
6. sorter for reduce

The current Hadoop Streaming allows to specify only the 1 and 2 (and 
gives a limited control on 3)
Nicely, the 1 (mapper) can be specified both as a command to stream the 
data through, or a Java class to use.

It would make a lot of sense to
(a) allow to specify a Java class that implements each of these
(b) provide meaningful defaults, so that the user of Hadoop Streaming 
does need to worry about details irrelevant for her specific task.
(c) provide a set of useful classes so that the user can pick the 
necessary ones rather than re-implementing same things again and again.
(c.1) make sure that there is a convenient short-hand to specify these 
predefined classes (e.g. without long package prefix)

In particular, it would be good to have predefined Identity mapper and 
reducer (the mapper actually is available now), reducers that provide 
simple aggregation (like in Abacus), input formats for commonly used 
formats (including CSV, flat XML, etc), sorter different from splitter, 

Then "Streaming should allow to specify a partitioner" would be 
automatically resolved as a special case.
It might be better to implement the whole consistent approach rather 
then do special cases one by one.

-- ab

On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:

> Streaming should allow to specify a partitioner
> -----------------------------------------------
>                  Key: HADOOP-1215
>                  URL: https://issues.apache.org/jira/browse/HADOOP-1215
>              Project: Hadoop
>           Issue Type: Improvement
>             Reporter: Runping Qi
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.

View raw message