Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <1795210387.1206046524857.JavaMail.jira@brutus>
Date: Thu, 20 Mar 2008 13:55:24 -0700 (PDT)
From: "Chris Douglas (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-449) Generalize the
 SequenceFileInputFilter to apply to any InputFormat
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580923#action_12580923 ] 

Chris Douglas commented on HADOOP-449:
--------------------------------------

bq. Theoretically every job can/should use the filtering functionality, since there is no drawback but lots of benefits. So this necessitates that the InputFormats of every job should be FilterInputFormat, shading the real InputFormat under FilterInputFormat#setBaseInputFormat().

Even if every job can use the filtering functionality, integrating it into Task/MapTask limits where it may be applied. If, for example, one were reading from multiple sources, different sets of filters could be applied to each source. Similarly, a map or a reduce task could use a filtering record reader to read a subset of records indirectly. If it's limited to the interfaces you provide to MapTask, then this code can't be reused elsewhere. Again, since weaving it into core doesn't seem to give you extra functionality- it seems to make it less general- and there's zero performance hit, making it a library looks laced with win.

bq. There is a lot of legacy code which can benefit from this, but people will be reluctant (or lazy) to convert their job's input format to filter. So maximum usability and minimum code change should be aimed.

I disagree, and I cite your previous patch. Its interface was not only easier to understand than the postfix additions, but specifying the baseInputFormat was very intuitive. For users seeking to benefit from this, the difficulty delta between the library and Task implementations is so slight that I doubt it'll actually prevent someone from taking advantage of it.

bq. Although the functionality is at the core, we only change a few lines(except FilterRR) from the Task and MapTask classes, effectively encapsulating the functionality. We may extract FilterRecordReader to its own class, so that it is completely separate. I should note that join can readily use filtering. The filtering just filters before passing the record to the mapper, so the joined keys would be filtered.

Not exactly. If I apply a RangeFilter to each of my record readers, the join considers a smaller subset of the records read. Since it's generating the cross of all the matching records (i.e. sets A, B, C sharing key k and containing values x1, x2 would emit [a1, b1, c1], [a1, b1, c2], [a1, b2, c1], ... [a2, b2, c2]), my filter would have to reject the cross of all those records, rather than each individually. Further, if I only want to filter the records from B in the previous example, the filter in my map would need more state to ensure I'm not emitting duplicate records to the map (or my map code would have to deal with that). One can imagine other cases where, again, filtering shouldn't be limited to a single part of the job, or cases where it might change the result if filters can only be applied at a certain stage.

bq. However it is very unlikely that a user may implement a FunctionFilter, but it is quite likely that she can implement a Filter. Thus adding a stack argument that no filter uses seems confusing and unnecessary. Consider the javadoc for the stack argument in the Filter#accept() method being as "@param stack filters should not use this".
Though Filters will be useful, the semantics of a FunctionFilter aren't so mysterious that people won't want to write those, too. Again, the purpose of both parameters is easily explained, and people will decide whether they should employ them or not. It seems premature to decide that there are only two types of filters, anyway. It sounds like we agree that it's a cleaner interface with only one signature for the eval; I'm just not sure I see the extensibility benefit as clearly.

> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filtering_v3.patch, filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412 so that it can be applied to any InputFormat. To do this, I propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an internal RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value) to the RecordReader interface, but that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.