Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 10937 invoked from network); 10 Apr 2007 20:59:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Apr 2007 20:59:14 -0000 Received: (qmail 81601 invoked by uid 500); 10 Apr 2007 20:59:20 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 81577 invoked by uid 500); 10 Apr 2007 20:59:20 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 81568 invoked by uid 99); 10 Apr 2007 20:59:20 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Apr 2007 13:59:20 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: neutral (herse.apache.org: local policy) Received: from [207.126.228.150] (HELO rsmtp2.corp.yahoo.com) (207.126.228.150) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 10 Apr 2007 13:59:13 -0700 Received: from explainfloorlx (explainfloor-lx.corp.yahoo.com [207.126.231.230]) by rsmtp2.corp.yahoo.com (8.13.8/8.13.6/y.rout) with ESMTP id l3AKweHL018725 for ; Tue, 10 Apr 2007 13:58:40 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=from:to:references:subject:date:message-id:mime-version: content-type:content-transfer-encoding:x-mailer:in-reply-to:x-mimeole:thread-index; b=P4R8hEoP9AVMxyM/ki2bcVgEeUS7k0BieIZqTI4eIh0OvjR7w5QiMCVYoN/7T6mJ From: "Runping Qi" To: References: <11894100.1175875352266.JavaMail.jira@brutus> Subject: RE: [jira] Created: (HADOOP-1215) Streaming should allow to specify a partitioner Date: Tue, 10 Apr 2007 13:58:40 -0700 Message-ID: <001d01c77bb3$045796c0$174d480a@ds.corp.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Mailer: Microsoft Office Outlook 11 In-Reply-To: X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3028 Thread-Index: Acd7rlVchfjxdbl/QRiuYPNFbIbRXwAAhOug X-Virus-Checked: Checked by ClamAV on apache.org Hi Arkady, With my changes that should be available soon, the user can specify the followings: 1. Mapper (a java mapper class or an executable) 2. Reducer (a Java reducer class or an executable). Reduce NONE will be introduced as per HADOOP-1216. 3. Inputformat class 4. OutputFormat class 5. Partitioner I don't understand what do you mean by (input partitioner, splitter for reduce, sorter for reduce). Can you explain? Hadoop has a collection of built-in classes: IdentityMapper, IdentityReducer, RegexMapper, TokenCountMapper, LongSumReducer TextInputFormat, SequenceFileInputFormat, TextOutputFormat, SequenceFileOutputFormat, NullOutputFormat Some more coming soon: SequenceFileToLineInputFormat, KeyValueTextInputFormat. We can add IdentityMapper/IdentityReducer/ KeyValueTextInputFormat/TextOutputFormat as the defaults for Hadoop Streaming. Runping > -----Original Message----- > From: Arkady Borkovsky [mailto:arkady@yahoo-inc.com] > Sent: Tuesday, April 10, 2007 1:24 PM > To: hadoop-dev@lucene.apache.org > Subject: Re: [jira] Created: (HADOOP-1215) Streaming should allow to > specify a partitioner > > To extend this, > I'd suggest that Hadoop Streaming is interfaced in the following way: > > Map reduce process is parameterized by several algorithms. > This includes at least > 1. mapper > 2. reducer (including special case of NONE) > 3. input format > 4. input partitioner > 5. splitter for reduce > 6. sorter for reduce > > The current Hadoop Streaming allows to specify only the 1 and 2 (and > gives a limited control on 3) > Nicely, the 1 (mapper) can be specified both as a command to stream the > data through, or a Java class to use. > > It would make a lot of sense to > (a) allow to specify a Java class that implements each of these > (b) provide meaningful defaults, so that the user of Hadoop Streaming > does need to worry about details irrelevant for her specific task. > (c) provide a set of useful classes so that the user can pick the > necessary ones rather than re-implementing same things again and again. > (c.1) make sure that there is a convenient short-hand to specify these > predefined classes (e.g. without long package prefix) > > In particular, it would be good to have predefined Identity mapper and > reducer (the mapper actually is available now), reducers that provide > simple aggregation (like in Abacus), input formats for commonly used > formats (including CSV, flat XML, etc), sorter different from splitter, > etc. > > Then "Streaming should allow to specify a partitioner" would be > automatically resolved as a special case. > It might be better to implement the whole consistent approach rather > then do special cases one by one. > > -- ab > > > On Apr 6, 2007, at 9:02 AM, Runping Qi (JIRA) wrote: > > > Streaming should allow to specify a partitioner > > ----------------------------------------------- > > > > Key: HADOOP-1215 > > URL: https://issues.apache.org/jira/browse/HADOOP-1215 > > Project: Hadoop > > Issue Type: Improvement > > Reporter: Runping Qi > > > > > > > > > > -- > > This message is automatically generated by JIRA. > > - > > You can reply to this email to add a comment to the issue online. > >