lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Mon, 21 Dec 2009 14:50:18 GMT

     [ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Uwe Schindler updated SOLR-1677:
--------------------------------

    Attachment: SOLR-1677.patch

New patch with some schema and config hacking. Also new test:

- As a first hack the solrConfig schema has a new element <luceneMatchVersion> that
contains a solr-wide default luceneMatchVersion value that is used as default for QueryParser,
Analyzers if not specified different
- On the analyzer side, BaseTokenizerFactory and BaseTokenFilterFactory now extend SolrCoreAware
(and I also allowed these classes to be SolrCoreAware) and get the SolrConfig.
- Both classes now use the default, if not local set as a param (like in the last patch),
but the default is the one got from SolrConfig
- The parser for config strings was moved to Config
- Other components like QueryParserFactories can get the default matchVersion in the same
way
- The default is LUCENE_24 as before.

This is a first idea, how it would work. Open points:
- should the default be in SolrConfig or in IndexConfig?
- I did not change the config.xsd file to reflect my change as open discussion
- all other example config files and schemas should use the default Lucene version shipped
with the solr release (currently 2.9). So user that upgrade get their last lucene version
their index is compatible with, and new users get the latest config.
- If users upgrade the default luceneMatchVersion, they have to possibly reindex (esp. when
upgrading to LUCENE_31 soon, as new Unicode features in all Tokenizers/Filters)

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message