Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 75294 invoked from network); 7 Feb 2008 22:13:34 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 7 Feb 2008 22:13:34 -0000 Received: (qmail 75182 invoked by uid 500); 7 Feb 2008 22:13:25 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 75141 invoked by uid 500); 7 Feb 2008 22:13:25 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 75132 invoked by uid 99); 7 Feb 2008 22:13:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2008 14:13:25 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 07 Feb 2008 22:13:17 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C2F25714082 for ; Thu, 7 Feb 2008 14:13:09 -0800 (PST) Message-ID: <14831272.1202422389795.JavaMail.jira@brutus> Date: Thu, 7 Feb 2008 14:13:09 -0800 (PST) From: "Chris Douglas (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12566819#action_12566819 ] Chris Douglas commented on HADOOP-449: -------------------------------------- bq. I think people will understand that nesting FilterInputFormat cannot be done with the current API True. The case I had in mind wasn't so much nesting as it was simultaneous instantiation. For example, the classes in mapred.join wouldn't hesitate to accept multiple FilterInputFormats for multiple datasources, but- as you point out- the API is such that one would quickly realize that chained filters aren't possible without some effort. I was hoping that these classes could be integrated into the aforementioned framework and I'm confident they can be. On that note, the MD5PercentFilter guards access to the MessageDigest within the instance, but multiple instances could corrupt it. It would probably be better if it were not static. Also, since a filter may discard the vast majority of the input, is it necessary to update the reporter to avoid a timeout? A call to next may churn through data for some time, and I'm uncertain whether one can expect the base InputFormat to keep the task alive. I'd expect it to be fine for the majority of cases, but if you felt like being paranoid it's not insane. bq. If you see a better solution to pass the Writables to the tasks, I will be very glad to adopt it. Or should we add setWritable() getWritable() to the Configuration? I don't, sorry. :) I remember the JIRA you mention, the rejection of get/setWritable, and the reasoning probably remains sound. Other than the solutions you propose, the only other way I can think of would be to have an auxiliary InputFormat/input dir that slurps a set of keys (no splits!) into an in-memory Set and assume that the OOM exceptions are a strong hint to the user. Gross. Still, you could probably still restrict these to Text, as long as the user is aware of SequenceFileAsTextInputFormat and related options. Automatically converting to String could produce some weird results if one isn't aware of how the filter is effected. Forcing someone to figure out how to get their WritableComparables to Text is ample warning, I think. > Generalize the SequenceFileInputFilter to apply to any InputFormat > ------------------------------------------------------------------ > > Key: HADOOP-449 > URL: https://issues.apache.org/jira/browse/HADOOP-449 > Project: Hadoop Core > Issue Type: Improvement > Components: mapred > Affects Versions: 0.17.0 > Reporter: Owen O'Malley > Assignee: Enis Soztutar > Fix For: 0.17.0 > > Attachments: filterinputformat_v1.patch > > > I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412 so that it can be applied to any InputFormat. To do this, I propose: > interface WritableFilter { > boolean accept(Writable item); > } > class FilterInputFormat implements InputFormat { > ... > } > FilterInputFormat would look in the JobConf for: > mapred.input.filter.source = the underlying input format > mapred.input.filter.filters = a list of class names that implement WritableFilter > The FilterInputFormat will work like the current SequenceFilter, but use an internal RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value) to the RecordReader interface, but that will be addressed in a different issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.