hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat
Date Mon, 17 Mar 2008 17:44:27 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Enis Soztutar updated HADOOP-449:

    Attachment: filtering_v2.patch

After spending some time on thinking about his patch, I have redesigned the API. The changes
are : 

* Refactored WritableFilter to Filter, so that Filter can be applied to non-Writables (according
to Serialization framework)
* Added a Stringifier interface and a Default implementation using hadoop serialization framework.
Now ordinary objects can be kept in the configuration. Acknowledging the performance loss
in String.equals() comparison, we had to pass the actual objects in the configuration, or
not use filtering at all.
* Added FilterEngine to evaluate postfix filter expressions
* Added OR, AND, NOT Filters
* Fixed synchronization issue in MessageDigest
* Filtering is moved to core framework instead of a library. 
* Changed the API so that JobConf is now used to add filters. This API is better since it
hides nearly all the details from the appliaction code. The applications just configures the
filter by calling JobConf#addFilter().
* Added a counter for filtered-out records
* Added filtering section to the mapred tutorial. 

> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>         Attachments: filtering_v2.patch, filterinputformat_v1.patch
> I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412
so that it can be applied to any InputFormat. To do this, I propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an internal
RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value)
to the RecordReader interface, but that will be addressed in a different issue.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message