lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Wed, 20 Jan 2010 01:51:54 GMT


Robert Muir commented on SOLR-1677:

Hi Hoss Man, 

I think I am slightly offended with some of your statements about 'subjective opinion of the
Lucene Community' and 'they should do relevancy testing which use some language-specific stemmer
whose behavior changed in a small but significant way'.

I've personally restricted my contributions of language support to those I have either personally
relevance tested, or developing from published relevance results. These results are all listed
on each JIRA ticket (MAP values and such). I can give you a list of all these issues if you

As far as changing stemmers, we have never done this.
The only "stemmer changing" I have proposed is fixing bugs, where I have taken the snowball
test data and found either bugs in snowball or duplicate implementations we have in our own
source tree.
And to "fix the bugs" I have only proposed that we simply use snowball itself rather than
some duplicate, buggy hand-coded implementatation.

So I'm a little confused about what you are referring to... some theoretical situation?

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>                 Key: SOLR-1677
>                 URL:
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message