Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Received-SPF: neutral (herse.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns;
	h=from:to:references:subject:date:message-id:mime-version:
	content-type:content-transfer-encoding:x-mailer:in-reply-to:x-mimeole:thread-index;
	b=P4R8hEoP9AVMxyM/ki2bcVgEeUS7k0BieIZqTI4eIh0OvjR7w5QiMCVYoN/7T6mJ
From: "Runping Qi" <runping@yahoo-inc.com>
To: <hadoop-dev@lucene.apache.org>
References: <11894100.1175875352266.JavaMail.jira@brutus>
 <e53e73289bb3e8fe68c8e4e5f470b4aa@yahoo-inc.com>
Subject: RE: [jira] Created: (HADOOP-1215) Streaming should allow to specify a
 partitioner
Date: Tue, 10 Apr 2007 13:58:40 -0700
Message-ID: <001d01c77bb3$045796c0$174d480a@ds.corp.yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
In-Reply-To: <e53e73289bb3e8fe68c8e4e5f470b4aa@yahoo-inc.com>
Thread-Index: Acd7rlVchfjxdbl/QRiuYPNFbIbRXwAAhOug


Hi Arkady,

With my changes that should be available soon, the user can specify the
followings:

1. Mapper (a java mapper class or an executable)
2. Reducer (a Java reducer class or an executable). Reduce NONE will be
introduced as per HADOOP-1216.
3. Inputformat class
4. OutputFormat class
5. Partitioner

I don't understand what do you mean by (input partitioner, splitter for
reduce, sorter for reduce). Can you explain?

Hadoop has a collection of built-in classes:

IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper,
LongSumReducer

TextInputFormat, SequenceFileInputFormat, TextOutputFormat,
SequenceFileOutputFormat, NullOutputFormat

Some more coming soon:

SequenceFileToLineInputFormat, KeyValueTextInputFormat.

We can add IdentityMapper/IdentityReducer/
KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop
Streaming.


Runping


> -----Original Message-----
> From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com]
> Sent: Tuesday, April 10, 2007 1:24 PM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to
> specify a partitioner
> 
> To extend this,
> I'd suggest that Hadoop Streaming is interfaced in the following way:
> 
> Map reduce process is parameterized by several algorithms.
> This includes at least
> 1. mapper
> 2. reducer  (including special case of NONE)
> 3. input format
> 4. input partitioner
> 5. splitter for reduce
> 6. sorter for reduce
> 
> The current Hadoop Streaming allows to specify only the 1 and 2 (and
> gives a limited control on 3)
> Nicely, the 1 (mapper) can be specified both as a command to stream the
> data through, or a Java class to use.
> 
> It would make a lot of sense to
> (a) allow to specify a Java class that implements each of these
> (b) provide meaningful defaults, so that the user of Hadoop Streaming
> does need to worry about details irrelevant for her specific task.
> (c) provide a set of useful classes so that the user can pick the
> necessary ones rather than re-implementing same things again and again.
> (c.1) make sure that there is a convenient short-hand to specify these
> predefined classes (e.g. without long package prefix)
> 
> In particular, it would be good to have predefined Identity mapper and
> reducer (the mapper actually is available now), reducers that provide
> simple aggregation (like in Abacus), input formats for commonly used
> formats (including CSV, flat XML, etc), sorter different from splitter,
> etc.
> 
> Then "Streaming should allow to specify a partitioner" would be
> automatically resolved as a special case.
> It might be better to implement the whole consistent approach rather
> then do special cases one by one.
> 
> -- ab
> 
> 
> On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote:
> 
> > Streaming should allow to specify a partitioner
> > -----------------------------------------------
> >
> >                  Key: HADOOP-1215
> >                  URL: https://issues.apache.org/jira/browse/HADOOP-1215
> >              Project: Hadoop
> >           Issue Type: Improvement
> >             Reporter: Runping Qi
> >
> >
> >
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >