hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Enis Soztutar (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat
Date Thu, 20 Mar 2008 18:37:24 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12580886#action_12580886
] 

Enis Soztutar commented on HADOOP-449:
--------------------------------------

bq. Wow, looks like quite a rewrite! A few points:
Indeed it is *smile*

bq. Would it be possible to separate the Stringify functionality into a separate JIRA? It
is independently useful functionality and would be easier to review if it were extracted from
an already formidable patch. I like this idea.
Done in HADOOP-3048. 
bq. Your previous version- where this was library, not framework code- is preferred. The FilterRecordReader
should be part of a FilterInputFormat, which is- as you illustrated with your earlier patch-
not an undue burden on a user. Weaving this into Task and MapTask doesn't seem to add any
additional functionality or syntactic elegance. You could add counters to the library code
without changing core functionality. Being part of Task also prevents it from being integrated
into other libraries (like the aforementioned join framework).
Implementing this functionality as a library was good, and as you mentioned, there are lots
of benefits for doing so. However after spending some time on this I realized implementing
this in the core framework would be even better, because : 
 * Theoretically *every* job can/should use the filtering functionality, since there is no
drawback but lots of benefits. So this necessitates that the InputFormats of *every* job should
be FilterInputFormat, shading the real InputFormat under FilterInputFormat#setBaseInputFormat().

 * There is a lot of legacy code which can benefit from this, but people will be reluctant
(or lazy) to convert their job's input format to filter. So maximum usability and minimum
code change should be aimed. 
 * Although the functionality is at the core, we only change a few lines(except FilterRR)
from the Task and MapTask classes, effectively encapsulating the functionality. We may extract
FilterRecordReader to its own class, so that it is completely separate. I should note that
join can readily use filtering. The filtering just filters before passing the record to the
mapper, so the joined keys would be filtered. 

At the risk of repeating myself, I believe that every InputFormat of every job should be FilterInputFormat,
which is why we should integrate this to the core. However I am open to any suggestions and
discussions. *smile*

bq. There are no progress updates as you churn through records in your filtering engine. You
might want to keep a reporter around to prevent timeouts.
I will check that
bq. It would be much better if one could employ a (Raw)Comparator instead of using compareTo
in RangeFilter and equals in ItemFilter. Using JobConf::getOutputKeyComparator would be a
good start, though one could imagine uses where one would want different comparators for different
filters.
I will check that
bq.  It is unnecessary to distinguish between Filters and FunctionFilters. A single eval method
accepting the key and stack is sufficient (and permits functions like AND and OR to short-circuit,
somewhat). Forcing the application of every filter should probably be avoided when possible
(see next bullet).
Yes you're right (indeed I first did that way). However it is very unlikely that a user may
implement a FunctionFilter, but it is quite likely that she can implement a Filter. Thus adding
a stack argument that no filter uses seems confusing and unnecessary. Consider the javadoc
for the stack argument in the Filter#accept() method being as "@param stack filters should
not use this". 

bq. FilterEngine uses a static method and int to track the insertion order of filters and
their associated properties, which is probably not the correct approach. Consider a case where
a client submits two jobs, both with filters. Given that you require the user to add their
expression in postfix notation, it doesn't seem unduly onerous to require them to track the
position of their filters and avoid keeping a count. That said, assigning sequential identifiers
to your filter parameters is probably not an ideal approach in itself. You might consider
writing a parser or creating a Stringify-able object hierarchy (as in Swing, sort of) for
your expressions. Something that looks more like an expression parse tree would probably effect
a more efficient data structure.
I also think that the best way to do the filtering was creating the object hierarchy, then
pass it to the configuration, but somehow I did the postfix expressions way. I should think
a better way I guess. I guess serializing the filters -> stringify -> store in conf
-> pass to Task -> load from conf -> destringify -> deserialize might work. 

bq. Why is the MessageDigest static in MD5PercentFilter? Unless it is expensive to create,
the synchronization will probably cost more than making this an instance variable.
I just took the implementation MessageDigest$MD5PercentFilter, which used static digest, and
I assumed that getInstance() returns the singleton instance of the object. However it seems
it is not the case. I will fix this. 

bq. Well done! I look forward to this.
Me too, thanks for the in depth review Chris.



> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filtering_v3.patch, filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412
so that it can be applied to any InputFormat. To do this, I propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an internal
RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value)
to the RecordReader interface, but that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message