hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat
Date Tue, 18 Mar 2008 22:38:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580149#action_12580149
] 

Chris Douglas commented on HADOOP-449:
--------------------------------------

Wow, looks like quite a rewrite! A few points:
* Would it be possible to separate the Stringify functionality into a separate JIRA? It is
independently useful functionality and would be easier to review if it were extracted from
an already formidable patch. I like this idea.
** DefaultStringifier could reuse more of the objects it creates. The serializer will hold
a reference to the ByteArray(Input/Output)Buffers you've defined, so re-opening and closing
them every time is unnecessary if you use a resetable stream like o.a.h.io.Data(Input/Output)Buffer.
** The static convenience methods (load, store, storeArray) that create new Stringifiers in
DefaultStringifier are probably unnecessary. Adding a stringified object to a config doesn't
need that much support. Further, it's not clear to me how one would register a Stringifier
that isn't a DefaultStringifier. Would it make sense to follow the pattern in the Serialization
framework from HADOOP-1986- or the WritableComparator static initialization- and register
custom Stringifiers for classes?
* Your previous version- where this was library, not framework code- is preferred. The FilterRecordReader
should be part of a FilterInputFormat, which is- as you illustrated with your earlier patch-
not an undue burden on a user. Weaving this into Task and MapTask doesn't seem to add any
additional functionality or syntactic elegance. You could add counters to the library code
without changing core functionality. Being part of Task also prevents it from being integrated
into other libraries (like the aforementioned join framework).
* There are no progress updates as you churn through records in your filtering engine. You
might want to keep a reporter around to prevent timeouts.
* It would be much better if one could employ a (Raw)Comparator instead of using compareTo
in RangeFilter and equals in ItemFilter. Using JobConf::getOutputKeyComparator would be a
good start, though one could imagine uses where one would want different comparators for different
filters.
* It is unnecessary to distinguish between Filters and FunctionFilters. A single eval method
accepting the key and stack is sufficient (and permits functions like AND and OR to short-circuit,
somewhat). Forcing the application of every filter should probably be avoided when possible
(see next bullet).
* FilterEngine uses a static method and int to track the insertion order of filters and their
associated properties, which is probably not the correct approach. Consider a case where a
client submits two jobs, both with filters. Given that you require the user to add their expression
in postfix notation, it doesn't seem unduly onerous to require them to track the position
of their filters and avoid keeping a count. That said, assigning sequential identifiers to
your filter parameters is probably not an ideal approach in itself. You might consider writing
a parser or creating a Stringify-able object hierarchy (as in Swing, sort of) for your expressions.
Something that looks more like an expression parse tree would probably effect a more efficient
data structure.
* Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive to create, the
synchronization will probably cost more than making this an instance variable.

Well done! I look forward to this.

> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filtering_v3.patch, filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412
so that it can be applied to any InputFormat. To do this, I propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an internal
RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value)
to the RecordReader interface, but that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message