lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Sokolov (JIRA)" <>
Subject [jira] [Updated] (LUCENE-8240) Make TokenStreamComponents.setReader public
Date Thu, 05 Apr 2018 14:27:00 GMT


Mike Sokolov updated LUCENE-8240:
    Summary: Make TokenStreamComponents.setReader public  (was: Support different analysis
per field instance)

> Make TokenStreamComponents.setReader public
> -------------------------------------------
>                 Key: LUCENE-8240
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: modules/analysis
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments:
> The simplest change for this would be to make TokenStreamComponents.setReader() public.
Another alternative would be to provide a SubFieldAnalyzer along the lines of what is attached,
although for reasons given below I think this implementation is a little hacky and would ideally
be supported in a different way before making *that* part of a public Lucene API.
> Exposing this method would allow a third-party extension to access it in order to wrap
TokenStreamComponents. My use case is a SubFieldAnalyzer (attached, for reference) that applies
different analysis to different instances of a field. This supports a big "catch-all" field
that has different (index-time) text processing. The way we implement that is by creating
a TokenStreamComponents that wraps separate per-subfield components and switches among them
when setReader() is called.
> Why setReader()? This is the only part of the API where we can inject this notion of
subfields. setReader() is called with a Reader for each field instance, and we supply a special
Reader that identifies its subfield.
> This is a bit hacky – ideally subfields would be first-class citizens in the Analyzer
API, so eg there would be methods like Analyzer.createComponents(String fieldName, String
subFieldName), etc. However this seems like a pretty big change for an experimental feature,
so it seems like an OK tradeoff to live with the Reader-per-subfield hack for now.
> Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis package in order
to call TokenStreamComponents.setReader (on a separate instance) and propitiate java's code-hiding
rules, which is awkward.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message