lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steven Rowe (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2470) Add conditional braching/merging to Lucene's analysis pipeline
Date Wed, 19 May 2010 17:36:53 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869217#action_12869217
] 

Steven Rowe commented on LUCENE-2470:
-------------------------------------

One more thing from #lucene: if a conditionally-applied filter isn't given one or more input
stream tokens, it could either be reset(), or it could detect position increment gaps.  Maybe
both behaviors should be selectable via configuration?

> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
>                 Key: LUCENE-2470
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>    Affects Versions: 4.0
>            Reporter: Steven Rowe
>            Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to apply filter(s)
to only part of an input stream's tokens, under user-specifiable conditions (e.g. when a given
token attribute has a particular value) in a way that did not place that responsibility on
individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same way as CJKTokenizer,
which generates bigrams, if it could call ShingleFilter only when the TypeAttribute=<CJK>,
or if Robert's new ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when token length
exceeds some threshold.  For example, a user could configure an analyzer to stem only when
CharTermAttribute length is greater than 4 characters.
> One potential way to achieve this conditional branching facility is with a new kind of
filter that can be configured with one or more following filters and condition(s) under which
the filter should be engaged.  This could be called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the current
pipeline architecture, to have a single pipeline endpoint.  A MergingFilter might be useful
in its own right, e.g. to collect document data from multiple sources.  Perhaps a conditional
merging facility would be useful as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message