hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat
Date Mon, 11 Feb 2008 19:00:10 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567750#action_12567750

Chris Douglas commented on HADOOP-449:

bq. I did not think about the join framework. Having a look at it, i guess we can still stick
with the current framework.

I think your example would work, but I was considering filters at arbitrary positions in the
join. I was thinking of adding a new node to the parser that accepts a Filter and an argument
(the range, the regexp, etc) and sets the filter expression prior to the instantiation of
the RecordReader (as it does for mapred.input.dir). Both should work.

bq. I think current implementation is OK, since we are updating and digesting the MessageDigest
in only the MD5Hashcode function which is already synchronized.

The MD5Hashcode function is synchronized on the instance, but it's protecting a static. Unless
there's only one instance of the MD5PercentFilter, synchronizing on the method is insufficient,

bq. I think we better be pragmatic about this one. Lets not spend some nontrivial amount of
effort on this. We can fix it if it is exploited in some way.

*nod* Again, I think it'll be fine for the majority of cases, but I thought I'd mention it.

bq. People are expected to read the javadocs before using the classes.

Well, fair enough. Really, it only supports Text, and this seems like a convenient way to
annotate the class since it's not difficult to effect the translation. Further, toString isn't
usually considered in the Comparable/equals/hashCode family of equality, so it seems risky.

> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>         Attachments: filterinputformat_v1.patch
> I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412
so that it can be applied to any InputFormat. To do this, I propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an internal
RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value)
to the RecordReader interface, but that will be addressed in a different issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message