hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris Douglas (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-449) Generalize the SequenceFileInputFilter to apply to any InputFormat
Date Fri, 28 Mar 2008 00:04:26 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582845#action_12582845
] 

Chris Douglas commented on HADOOP-449:
--------------------------------------

bq. will change the postfix expressions and develop a more intuitive way...

+1 I like this syntax. Since you're passing serialized objects with your filters, you might
want to test larger expressions to make sure length limits in the Configuration aren't a problem,
but hitting them seems unlikely. I don't know if we even have limits in that area, but again:
it'd be worth testing. On that note, for the FunctionFilters you've defined, it might be a
good idea to permit them to take an arbitrary number of arguments >=2, as in:

{noformat}
Filter f1 = new RangeFilter(2, 5);
Filter f2 = new RangeFilter(10, 20);
Filter f3 = new RangeFilter(30, 40);
Filter orFilter = new ORFilter(f1, f2, f3);
{noformat}

With your new syntax, this would be both easy to implement, presents more opportunities for
optimization within your FilterEngine, and is very convenient for users.

bq. Having one eval method is a cleaner interface to core developers who could understand
how the postfix expression is evaluated...

Isn't all of this hidden by the FilterEngine? I'm not sure I understand what you're asserting
in this paragraph... I thought we were discussing whether or not it made sense to collapse
Filters and FunctionFilters into a single Filter interface that manipulates the key/stack.
By construction, you know that your FunctionFilters have either Filters or FunctionFilters
as children. Once you reconstruct the tree, it's not clear to me why you'd even need a stack.
The key gets passed through your tree to the child Filters, which return results to the parent,
which may or may not pass the key to its other children depending on the return value. It
might make sense to have a FunctionFilter base type from which your operators descend- since
they share common functionality- but the additional interface seems unnecessary. Have I misunderstood
you, or am I responding to your new syntax instead of the original, postfix, stack-based implementation?

bq. The postfix additions is irrelevant to whether filtering should be a library or not. The
postfix expressions are a way to specify the filtering expression to use, that part of the
API will not be changed if we had sticked with FilterInputFormat.

Sorry, I was unclear. You're right, the postfix syntax is orthogonal to this discussion since
that functionality wasn't present in the original patch. I was only pointing out that those
who could benefit from Filters aren't going to be turned away because they need to use a different
InputFormat, i.e. using the library poses a more familiar and less difficult problem to users
than the syntax and implications of Filters.

bq. [library vs core in general]

It cannot be disputed that your integration of Filters into Tasks has a negligible cost and
that it does not prohibit its use elsewhere and in other frameworks. That said, the semantics
of Filters match those of InputFormat precisely. At its point of integration, it does precisely
what an InputFormat would effect (with one caveat concerning map counters). It also avoids
any confusion about where the filtering occurs, particularly where other decorator InputFormats
are applied. Though I'm sympathetic to making Filtering part of every job, setting the InputFormat
seems like a modest burden that happens to also fit the existing semantics in an intuitive
and efficient way.

> Generalize the SequenceFileInputFilter to apply to any InputFormat
> ------------------------------------------------------------------
>
>                 Key: HADOOP-449
>                 URL: https://issues.apache.org/jira/browse/HADOOP-449
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>    Affects Versions: 0.17.0
>            Reporter: Owen O'Malley
>            Assignee: Enis Soztutar
>             Fix For: 0.17.0
>
>         Attachments: filtering_v2.patch, filtering_v3.patch, filterinputformat_v1.patch
>
>
> I'd like to generalize the SequenceFileInputFormat that was introduced in HADOOP-412
so that it can be applied to any InputFormat. To do this, I propose:
> interface WritableFilter {
>    boolean accept(Writable item);
> }
> class FilterInputFormat implements InputFormat {
>   ...
> }
> FilterInputFormat would look in the JobConf for:
>    mapred.input.filter.source = the underlying input format
>    mapred.input.filter.filters = a list of class names that implement WritableFilter
> The FilterInputFormat will work like the current SequenceFilter, but use an internal
RecordReader rather than the SequenceFile. This will require adding a next(key) and getCurrentValue(value)
to the RecordReader interface, but that will be addressed in a different issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message