lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Wed, 06 Jan 2010 01:21:54 GMT


Hoss Man commented on SOLR-1677:

bq. Version applies to all of lucene (even more than tokenstreams), so for Carl to imply that
you don't need to reindex by bumping Version simply because you aren't using X or Y or Z,
for that he should be renamed Oscar.

Ok, fair enough ... i was supposing in that example that since i called it {{<luceneAnalyzerVersionDefault/>}}
it was clearly specific to analysis objects in schema.xml and didn't affect any of the other
things Version is used for (which would be specified in solrconfig.xml)

bq. i guess he is probably using Windows 3.1 still too because he doesn't want to upgrade

No, he uses an OS where he can upgrade indivudal things individually with clear implications
-- he sets {{luceneMatchVersion="2.9"}} on each and every {{<analyzer/>}}, {{<tokenizer/>}}
and {{<filter/>}} that he declares in his schema so that he knows exactly what behavior
is changing when he modifies any of them.

bq. personally I don't want all users to be stuck with Version.LUCENE_24 forever. 

I still must be missing something? ... why would all users be stuck with Version.LUCENE_24

I'm not advocating that we don't allow a way to specify Version, i'm saying that having a
global value for it that affects things opaquely sounds dangerous -- we should certianly have
a way for people to specify the Version they want on each of the objects that care, but it
shouldn't be global.  The "luceneMatchVersion" property that Uwe added to BaseTokenizerFactory
and BaseTokenFilterFactory in his patch seems perfect to me, it's just the {{SolrCoreAware}}
/ {{core.getSolrConfig().luceneMatchVersion}} that i think is a bad idea.

If we modify the <analyzer/> initialization to allow constructor args as Erik suggested
(I'm pretty sure there's already code in Solr to do this, we just aren't using it for Analyzers)
then we should be good to go for everything in schema.xml

If anything declared in solrconfig.xml starts caring about Version (QParser, SolrIndexWriter,
etc...) then likewise it should get a "luceneMatchVersion" init property as well.  No one
will ever be "stuck" with LUCENE_24, but they won't be surprised by behavior changes either.

bq. If we do not have a default, all users will keep stuck with lucene 2.4, because they do
not care about version (it is not required, because it defaults to 2.4 for BW compatibility).
So lots of configs will never use the new unicode features of Lucene 3.1.

I don't believe that.  Almost every solr user on the planet starts with the example configs.
 if the example configs start specifying "luceneMatchVersion=2.9" on every analyzer and factory
then people will care about Version just as much as they care about the stopwords.txt file
that ships with solr -- that may be not at all, or it may be a lot, but it will be up to them,
and it will be obvious to them, because it's right there in the declaration where they can
see it, and easy for them to refrence and recognize that changing that value will affect things.

bq. If you really do not want to have a default version in config (not schema, because it
applies to all lucene components), then you should go the way like Lucene 3.0: Require a matchVersion
for all components.

I'm totally on board with that idea in the long run -- but there are ways to get there gradually
that are back compatible with existing configs.  Individual factories that care about luceneMatchVersion
should absolutely start warning on startup that users should set luceneMatchVersion to get
newer/better behavior may be available if it is unset (or doesn't match the current value
of Version.LUCENE_CURRENT) and provide a URL for a wiki page somewhere where more detail is
available.  The Analyzer init code can do likewise if if sees an {{<analyzer class=.../>}}
being inited w/ a constructor that takes in a "Version" which is using an "old" value.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>                 Key: SOLR-1677
>                 URL:
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message