lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Mon, 04 Jan 2010 05:50:54 GMT


Hoss Man commented on SOLR-1677:

bq. The problem is the default value. If you leave out the version parameter instance-wise,
you will get 2.4. And because of that all solr users will get stuck with that version and
will never upgrade (because they leave the default and do not specify a different value).

That feels like a missleading statement ... the "Version" property on these objects is really
more about getting the "recommended" behavior as of a particular version of Lucene ... saying
that users will be "stuck with that version" is like saying users will be "stuck with StandardAnalyzer"
instead of getting "NewHotnessAnalyzer" because they have to edit their config to use the
newer/better analyzer -- Lucene-Java has opted to use a Version property on existing classes
instead of adding new classes, but it's still conceptually the same thing: they get the bahavior
they've always gotten, unless they change their config to get something different.

Besides which: 99.9% of Solr users copy the example config when they first start using Solr:
we can set a "version" property on every Analyzer/Factory used in the example schema.xml and
update them all when we upgrade the Lucene jars just as easily as we can update a single "global"
value (it's a search+replaceAll instead of a search+replace)

bq. Why are you so against a default value? 

My concern is that it introduces action at a distance -- and not in a good way.

Here's the scenerio that seems garunteed to happen quite a bit if we add some new {{<luceneAnalyzerVersionDefault/>}}
syntax to schema.xml...


{{<luceneAnalyzerVersionDefault>2.9</luceneAnalyzerVersionDefault>}} is added
to the example schema.xml, and users start using it as a result of copying/modifying the example
configs.  Time passes, new bugs are fixed, and the example configs evolve to contain {{<luceneAnalyzerVersionDefault>3.4</luceneAnalyzerVersionDefault>}}

A little while after that, User Bob emails solr-user with a question like...

Hey, I'm using FooTokenFilterFactory and i noticed that at query time i see behaviorX when
it really seems like i should see BehaviorY 

User Carl helpfully replies...

That was identified as a bug with FooTokenFilter that was fixed in Lucene 3.1, but the default
behavior was left as is for backcompatibility.  If you change your {{<luceneAnalyzerVersionDefault/>}}
value to 3.1 (or 3.2) you'll get the newer/better behavior -- but if you used FooTokenFilterFactory
in an _index_ analyzer you'll need to reindex.

Bob makes the change to 3.2 that Carl recommended, and is happy to see now his queries work.
 He only uses FooTokenFilterFactory at _query_ time, so he doens't bother to reindex, and
every thing seems fine.

What Bob doesn't realize (and what Carl wasn't aware of) is that elsewhere in hi's schema.xml
file, Bob is also using the YakTokenizerFactory on a differnet field (yakField), and the behavior
of the YakTokenizer changed in Lucene 3.0. Now _some_ documents/queries that use yakField
are failing -- and *failing silently.*


Things just get a lot simpler when all of the configuration for an Analyzer, TokenizerFactory,
or Tokenizer are all explict in their declaration -- indirect initialization is fine, as long
as it's obvious.  Ie: <field/> declarations referencing fieldTypes by name -- It's easy
to fuck up a bunch of fields by making a single change to one fieldType, but at least you
can grep for the name of the fieldType to see all the fields you are affecting.  

Even if "Carl" knows/remembers to warn "Bob" that changing {{<luceneAnalyzerVersionDefault/>}}
might change/break other things in his schema.xml the situation doesn't get much better: Uless
Bob (or Carl) skim the code for every Analyzer, Tokenizer, and TokenFilter used in Bob's schema,
they can't be sure what might get affected by making a small increase to the "global" luceneAnalyzerVersion
setting ... which means the only safe thing for Bob to do is to set the property individual
on the one place he really wants to make the change.

So why have the "global" in the first place?  It really just seems like more trouble then
it's worth.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>                 Key: SOLR-1677
>                 URL:
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message