lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Issue Comment Edited: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Tue, 05 Jan 2010 22:29:54 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12796872#action_12796872
] 

Uwe Schindler edited comment on SOLR-1677 at 1/5/10 10:29 PM:
--------------------------------------------------------------

In my opinion, the default in solrconfig.xml should be possible to set, because there is currently
no requirement to set a version for all TS components. This default is in the shipped solrconfig.xml
the version of the shipped lucene version. so new users can use the default config and extend
it like learned in all courses and books about solr. They do not need to care about the version.


If they upgrade their lucene version, their config keeps stuck on the previous seeting and
they are fine. If they want to change some of the components (like query parser, index writer,
index reader -- flex!!!), they can do it locally. So Bob could change like Ernest proposed.

If we do not have a default, all users will keep stuck with lucene 2.4, because they do not
care about version (it is not required, because it defaults to 2.4 for BW compatibility).
So lots of configs will never use the new unicode features of Lucene 3.1. And suddenly Lucene
4.0 comes out and all support for Lucene < 3 is removed, then all users cry. With a default
version set to 2.4, they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24
is no longer available as enum constant.

If you really do not want to have a default version in config (not schema, because it applies
to *all* lucene components), then you should go the way like Lucene 3.0: Require a matchVersion
for all components. As there may be tokenstream components not from lucene, make this attribute
in the schema only mandatory for lucene-streams (this can be done by my initial patch, too:
if the matchVersion property is missing then the matchVersion will get NULL and the factory
should thow IAE if required. In my original patch, only the parsing code should be moved out
of the factory into a util class in solr. Maybe also possible to parse "x.y"-style versions).

The problem here: Users upgrading from solr 1.4 will suddenly get errors, because their configs
get invalid. Ahh, and because they are stupid they add LUCENE_29 (from where should they know
that Solr 1.4 used Lucene 2.4 compatibility?). And then the mailing list gets flooded by questions
because suddenly the configs fail to produce results with old indexes.

      was (Author: thetaphi):
    In my opinion, the default in solrconfig.xml should be possible to set, because there
is currently no requirement to set a version for all TS components. This default is in the
shipped solrconfig.xml the version of the shipped lucene version. so new users can use the
default config and extend it like learned in all courses and books about solr. They do not
need to care about the version. 

If they upgrade their lucene version, their config keeps stuck on the previous seeting and
they are fine. If they want to change some of the components (like query parser, index writer,
index reader -- flex!!!), they can do it locally. So Bob could change like Ernest proposed.

If we do not have a default, all users will keep stuck with lucene 2.4, because they do not
care about version (it is not required, because it defaults to 2.4 for BW compatibility).
So lots of configs will never use the new unicode features of Lucene 3.1. And suddenly Lucene
4.0 comes out and all support for Lucene < 3 is removed, then all users cry. With a default
version set to 2.4, they will then get a runtime error in Lucene 4.0, saying that Version.LUCENE_24
is no longer available as enum constant.

If you really do not want to have a default version in config (not schema, because it applies
to *all* lucene components), then you should go the way like Lucene 3.0: Require a matchVersion
for all components. As there may be tokenstream components not from lucene, make this attribute
in the schema only mandatory for lucene-streams (this can be done by my initial patch, too:
if the matchVersion property is missing then the matchVersion will get NULL and the factory
should thow IAE if required. In my original patch, only the parsing code should be moved out
of the factory into a util class in solr. Maybe also possible to parse "x.y"-style versions).

The problem here: Users upgrading from solr 1.4 will suddenly get errors, because their configs
get invalid.
  
> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message