Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <5794374.1202746328572.JavaMail.jira@brutus>
Date: Mon, 11 Feb 2008 08:12:08 -0800 (PST)
From: "Enis Soztutar (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-449) Generalize the
 SequenceFileInputFilter to apply to any InputFormat
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567679#action_12567679 ] 

Enis Soztutar commented on HADOOP-449:
--------------------------------------

bq. The case I had in mind wasn't so much nesting as it was simultaneous instantiation. For example, the classes in mapred.join wouldn't hesitate to accept multiple FilterInputFormats for multiple datasources, but- as you point out- the API is such that one would quickly realize that chained filters aren't possible without some effort. I was hoping that these classes could be integrated into the aforementioned framework and I'm confident they can be.
I did not think about the join framework. Having a look at it, i guess we can still stick with the current framework. The reason is that FilterInputFormat filters based on keys, and join framework performs the joins based on keys, and thus when we perform the joins we would want to use the same filter for all the datasets. Personally i have not tried, but i guess the following might work :
{code}
//set join expression as usual 
job.set("mapred.join.expr", "inner(tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,"
          "hdfs://host:8020/foo/bar"),
      tbl(org.apache.hadoop.mapred.SequenceFileInputFormat.class,
          "hdfs://host:8020/foo/baz"))");
// wrap CompositeInputFormat with FilterInputFormat 
job.setInputFormat(FilterInputFormat.class);
FilterInputFormat.setBaseInputFormat(job, CompositeInputFormat.class); 
{code}

Do you think this will do the trick? 

bq. On that note, the MD5PercentFilter guards access to the MessageDigest within the instance, but multiple instances could corrupt it. It would probably be better if it were not static.
I think current implementation is OK, since we are updating and digesting the MessageDigest in only the MD5Hashcode function which is already synchronized. 

bq. Also, since a filter may discard the vast majority of the input, is it necessary to update the reporter to avoid a timeout? A call to next may churn through data for some time, and I'm uncertain whether one can expect the base InputFormat to keep the task alive. I'd expect it to be fine for the majority of cases, but if you felt like being paranoid it's not insane.
I think we better be pragmatic about this one. Lets not spend some nontrivial amount of effort on this. We can fix it if it is exploited in some way. 

bq. Still, you could probably still restrict these to Text
I envision that, given this implementation, filters will mostly be used for Text anyway but I do not think that we should limit the use of them. The fact that string conversion on keys before the comparison/regex matching will be done is clearly documented in the javadocs of the respective filters. People are expected to read the javadocs before using the classes. *smile*


> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412 so that it can be applied to any InputFormat. To do this, I propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an internal RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value) to the RecordReader interface, but that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.