lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Created: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline
Date Wed, 19 May 2010 17:26:53 GMT
Add conditional braching/merging to Lucene's analysis pipeline
--------------------------------------------------------------

                 Key: LUCENE-2470
                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Analysis
    Affects Versions: 4.0
            Reporter: Steven Rowe
            Priority: Minor


Captured from a #lucene brainstorming session with Robert Muir:

Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s) to
only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given
token attribute has a particular value) in a way that did not place that responsibility on
individual filters.

Two use cases:

# StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer,
which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>,
or if Robert's new ScriptAttribute=<Ideographic>.
# Stemming might make sense for some stemmer/domain combinations only when token length exceeds
some threshold.  For example, a user could configure an analyzer to stem only when CharTermAttribute
length is greater than 4 characters.

One potential way to achieve this conditional branching facility is with a new kind of filter
that can be configured with one or more following filters and condition(s) under which the
filter should be engaged.  This could be called BranchingFilter.

I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current pipeline
architecture, to have a single pipeline endpoint.  A MergingFilter might be useful in its
own right, e.g. to collect document data from multiple sources.  Perhaps a conditional merging
facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message