Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 92500 invoked from network); 28 Aug 2006 15:42:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 28 Aug 2006 15:42:42 -0000 Received: (qmail 69502 invoked by uid 500); 28 Aug 2006 15:42:41 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 69443 invoked by uid 500); 28 Aug 2006 15:42:41 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 69404 invoked by uid 99); 28 Aug 2006 15:42:40 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Aug 2006 08:42:40 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [209.237.227.198] (HELO brutus.apache.org) (209.237.227.198) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 28 Aug 2006 08:42:39 -0700 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 530A87142D9 for ; Mon, 28 Aug 2006 15:39:24 +0000 (GMT) Message-ID: <32829272.1156779564337.JavaMail.jira@brutus> Date: Mon, 28 Aug 2006 08:39:24 -0700 (PDT) From: "Benjamin Reed (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Commented: (HADOOP-372) should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs In-Reply-To: <23897633.1153332195926.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N [ http://issues.apache.org/jira/browse/HADOOP-372?page=comments#action_12430992 ] Benjamin Reed commented on HADOOP-372: -------------------------------------- We have a desperate need to be able to specify different inputformat classes, mappers, and partition functions in the same job. Our need roughly corresponds to this issue. However, we do not assume that a given input will need to be processed by only one inputformat, mapper, or partition function. I think Owen was right about directing the discussion towards HADOOP-451. If HADOOP-451 was fixed, we (meaning my project) would not have any issues in this area. We actually use a similar syntax to the one proposed by Michel: job.set("DispatchInputFormat.inputdirmap", "foo=org.example.FooInput bar=org.example.BarInput"), but imagine if we get job.set("DispatchInputFormat.inputdirmap", "foo=org.example.FooInput foo=org.example.BarInput"), now foo must be processed by both FooInput and BarInput. When the splits are created and sent to the mappers, the DispatchInputFormat will get {"foo", offset, length}, but it has no way of knowing wether to apply FooInput or BarInput. With HADOOP-451 fixed, we could encode the InputFormat to use in the split. > should allow to specify different inputformat classes for different input dirs for Map/Reduce jobs > -------------------------------------------------------------------------------------------------- > > Key: HADOOP-372 > URL: http://issues.apache.org/jira/browse/HADOOP-372 > Project: Hadoop > Issue Type: New Feature > Components: mapred > Affects Versions: 0.4.0 > Environment: all > Reporter: Runping Qi > Assigned To: Owen O'Malley > > Right now, the user can specify multiple input directories for a map reduce job. > However, the files under all the directories are assumed to be in the same format, > with the same key/value classes. This proves to be a serious limit in many situations. > Here is an example. Suppose I have three simple tables: > one has URLs and their rank values (page ranks), > another has URLs and their classification values, > and the third one has the URL meta data such as crawl status, last crawl time, etc. > Suppose now I need a job to generate a list of URLs to be crawled next. > The decision depends on the info in all the three tables. > Right now, there is no easy way to accomplish this. > However, this job can be done if the framework allows to specify different inputformats for different input dirs. > Suppose my three tables are in the following directory respectively: rankTable, classificationTable. and metaDataTable. > If we extend JobConf class with the following method (as Owen suggested to me): > addInputPath(aPath, anInputFormatClass, anInputKeyClass, anInputValueClass) > Then I can specify my job as follows: > addInputPath(rankTable, SequenceFileInputFormat.class, UTF8.class, DoubleWritable.class) > addInputPath(classificationTable, TextInputFormat.class, UTF8,class, UTF8.class) > addInputPath(metaDataTable, SequenceFileInputFormat.class, UTF8.class, MyRecord.class) > If an input directory is added through the current API, it will have the same meaning as it is now. > Thus this extension will not affect any applications that do not need this new feature. > It is relatively easy for the M/R framework to create an appropriate record reader for a map task based on the above information. > And that is the only change needed for supporting this extension. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira