lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Fri, 15 Jan 2010 02:58:54 GMT


Hoss Man commented on SOLR-1677:

Yes. The whole point is to avoid Analyzer mismatches.

Say a stoplist was modified between Lucene versions. Sure, you can hack it
and ask for an old match version, so you get a stoplist other than the one that
was used to build the index... but why would you want to?

...but that's no different then using StopFilter(someStopWordSet) at indexing and StopFilter(someOtherStopWordSet)
at query time -- Solr happily lets you do that with it's index/query analyzers ... you may
have a very good reason for doing that.  Likewise you may have an existing field using the
"default" stopwords list from Version.LUCENE_24 that you don't want to change because you
want clients that search on that field to continue to get the same behavior, but when you
add a new field you want it to have the current default stopwords because it's queried by
entirely different clients.

That's no differernet then saying i want PorterStemmer on fieldA and SnowBall2Stemmer on fieldB.

The implication i got from Robert was that there was (or would soon be) expectations in Lucene-Java
code that if one object was told to use Version.X it wold be assumed that every other object
in the application was using Version.X.

To be that's the crux of the whole issue:  If that _is_ the expectation Lucene-Java has, then
we _should_ have a single global config for luceneMatchVersion and not support per-object
configuration.  If that _is not_ the expectation, then we _should not_ have a global luceneMatchVersion.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>                 Key: SOLR-1677
>                 URL:
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message