lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mike Sokolov (JIRA)" <>
Subject [jira] [Commented] (LUCENE-8240) Make TokenStreamComponents.setReader public
Date Fri, 06 Apr 2018 19:53:00 GMT


Mike Sokolov commented on LUCENE-8240:

Well, I don't have much more to say, but perhaps this background from our use case will sway
you :) We did try breaking up our large catchall field into separate fields, since it is more
natural for Lucene than having these sub-fields. However we have so many of them (100s) that
the performance of our queries was poor due to the zillions of term queries we had to generate,
and in the end smooshing all these little fields together into one big one, with this switchable
analyzer ended up being the best tradeoff.

> Make TokenStreamComponents.setReader public
> -------------------------------------------
>                 Key: LUCENE-8240
>                 URL:
>             Project: Lucene - Core
>          Issue Type: Wish
>          Components: modules/analysis
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments:
> The simplest change for this would be to make TokenStreamComponents.setReader() public.
Another alternative would be to provide a SubFieldAnalyzer along the lines of what is attached,
although for reasons given below I think this implementation is a little hacky and would ideally
be supported in a different way before making *that* part of a public Lucene API.
> Exposing this method would allow a third-party extension to access it in order to wrap
TokenStreamComponents. My use case is a SubFieldAnalyzer (attached, for reference) that applies
different analysis to different instances of a field. This supports a big "catch-all" field
that has different (index-time) text processing. The way we implement that is by creating
a TokenStreamComponents that wraps separate per-subfield components and switches among them
when setReader() is called.
> Why setReader()? This is the only part of the API where we can inject this notion of
subfields. setReader() is called with a Reader for each field instance, and we supply a special
Reader that identifies its subfield.
> This is a bit hacky – ideally subfields would be first-class citizens in the Analyzer
API, so eg there would be methods like Analyzer.createComponents(String fieldName, String
subFieldName), etc. However this seems like a pretty big change for an experimental feature,
so it seems like an OK tradeoff to live with the Reader-per-subfield hack for now.
> Currently SubFieldAnalyzer has to live in org.apache.lucene.analysis package in order
to call TokenStreamComponents.setReader (on a separate instance) and propitiate java's code-hiding
rules, which is awkward.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message