hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Runping Qi" <runp...@yahoo-inc.com>
Subject RE: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner
Date Tue, 10 Apr 2007 20:58:40 GMT

Hi Arkady,

With my changes that should be available soon, the user can specify the
followings:

1. Mapper (a java mapper class or an executable)
2. Reducer (a Java reducer class or an executable). Reduce NONE will be
introduced as per HADOOP-1216.
3. Inputformat class
4. OutputFormat class
5. Partitioner

I don't understand what do you mean by (input partitioner, splitter for
reduce, sorter for reduce). Can you explain?

Hadoop has a collection of built-in classes:

IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper,
LongSumReducer

TextInputFormat, SequenceFileInputFormat, TextOutputFormat,
SequenceFileOutputFormat, NullOutputFormat

Some more coming soon:

SequenceFileToLineInputFormat, KeyValueTextInputFormat.

We can add IdentityMapper/IdentityReducer/
KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop
Streaming.


Runping




> -----Original Message-----
> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
> Sent: Tuesday, April 10, 2007 1:24 PM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
> specify a partitioner
> 
> To extend this,
> I'd suggest that Hadoop Streaming is interfaced in the following way:
> 
> Map reduce process is parameterized by several algorithms.
> This includes at least
> 1. mapper
> 2. reducer  (including special case of NONE)
> 3. input format
> 4. input partitioner
> 5. splitter for reduce
> 6. sorter for reduce
> 
> The current Hadoop Streaming allows to specify only the 1 and 2 (and
> gives a limited control on 3)
> Nicely, the 1 (mapper) can be specified both as a command to stream the
> data through, or a Java class to use.
> 
> It would make a lot of sense to
> (a) allow to specify a Java class that implements each of these
> (b) provide meaningful defaults, so that the user of Hadoop Streaming
> does need to worry about details irrelevant for her specific task.
> (c) provide a set of useful classes so that the user can pick the
> necessary ones rather than re-implementing same things again and again.
> (c.1) make sure that there is a convenient short-hand to specify these
> predefined classes (e.g. without long package prefix)
> 
> In particular, it would be good to have predefined Identity mapper and
> reducer (the mapper actually is available now), reducers that provide
> simple aggregation (like in Abacus), input formats for commonly used
> formats (including CSV, flat XML, etc), sorter different from splitter,
> etc.
> 
> Then "Streaming should allow to specify a partitioner" would be
> automatically resolved as a special case.
> It might be better to implement the whole consistent approach rather
> then do special cases one by one.
> 
> -- ab
> 
> 
> On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:
> 
> > Streaming should allow to specify a partitioner
> > -----------------------------------------------
> >
> >                  Key: HADOOP-1215
> >                  URL: https://issues.apache.org/jira/browse/HADOOP-1215
> >              Project: Hadoop
> >           Issue Type: Improvement
> >             Reporter: Runping Qi
> >
> >
> >
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >



Mime
View raw message