lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Mon, 11 Jan 2010 23:12:54 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798916#action_12798916
] 

Hoss Man commented on SOLR-1677:
--------------------------------

bq. I don't think Version is intended so you can use X.Y on this part and Y.Z on this part
and have any chance of anything working, for example it controls position increments on stopfilter
but also in queryparser, if you use wacky combinations, things might not work.

How is that any different from letting users pass any Analyzer they want to the QueryParser
constructor?  There's no guarantee that anything will every work if you do something crazy
(like uppercase all terms when indexing, and lowercase all terms when searching) But lucene
exposes that to the devolper and let's them make the choice -- likewise Solr happily lets
you configure a query analyzer that's completely different from your index analyzer -- if
that's what you want, that's what you get: being able to set different Version params should
be no different.  If the QueryParser you are using says that version=X.Y will only work with
StopFilter if it's version=X.Y as well that's fine -- but maybe you've solved that problem
a completely different way with a comppletley alternate implementation of StopFilter (that
doesn't care about version).  The user should be in control.

bq. sometimes things interact in ways we cannot detect automatically

which is why i think it's a bad idea to have a global default for this ... there may be situations
where people explicitly want different behavior in different instances (ie: in this field
i want the legacy 2.4 StopFilter behavior, but in this field i want the current 2.9 stop filter
behavior) and having a default will mask the ability to do this, and make it easy to inadvertantly
break it.

bq. its my understanding that things like this are why Version was created in the first place.

My understanding is castly different then yours ... All the discussions i remember about it
were along the lines of preventing Class proliferation -- that people didn't' like the idea
of creating StandardAnalyzer2 just because StandardAnalyzer had some behavior that was considered
buggy but couldn't be removed - so now there is a constructor arg instead, and static constants
that let you pick a fixed behavior, or a constant that let's you pick "current" no matter
what it is -- so applications that always want the "current recommended behavior" can just
upgrade a jar and get it.

But I don't remember any implication that it was expected that every object would have the
same Version settings as every other object -- if that was the intention then shouldn't there
be a standard interface for "Versionable" or "VersionAware" objects so they can test compatibility
with one another (ie: QueryParser and Analyzers that might wrap StopFilter) ? ... or a "{{public
static void setCurrentOperatingVersion(Version)}} method in the Version class, instead of
letting each constructor take in an independent value?

----

FWIW: Even though I'm still convinced that having any sort of "global" default value for luceneMatchVersion
is a bad idea -- and i'm going to keep trying to convince other people as well -- I want to
make some comments about how i think it should be implemented if we do wind up doing it (just
in case i get hit by a bus)

Making the Base*Factory analysis classses SolrCoreAware is really overkill for this -- there
was a real conscious choice not to let things declared in schema.xml be SolrCoreAware, because
it pulls back the curtain and exposes a lot of plumbing related APIs in way that could make
it hard to refactor away SolrCore functionality later.  The list of plugin types that can
be made SolrCoreAware is deliberately small, and confined to plugins that are already exposed
to the full SolrCore API at some other time in their life cycle -- being SolrCoreAware just
gives them access to the core during initialization.

If there is really going to be one uber-default global "luceneMatchVersion" then i think the
place it makes the most sense to declare something like this is in the schema.xml -- many
differnet solrconfig.xml files might be used with the same schema.xml, so if we're expecting
that the "typical" behavior is to set this once and have it just work it should propogate
from the IndexSchema object to the SolrCore and not vice-versa.

My suggestion for how to implement this would be...

# Add a new "luceneMatchVersion" attribute to the existing <schema/> tag.
# Add a new getLuceneMatchVersion() to the IndexSchema class ... SolrCore can use this to
get the default.
# When init()ing new objects, include the key=>value pair of {{"luceneMatchVersion"=>schema.getLuceneMatchVersion()}}
to the init method of the object if it's not already an init param for that particular instance.

This would eliminate the need to make any of the Analysis Factories SolrCoreAware (or even
ResourceLoaderAware) just to know what the luceneMatchVersion should be -- the Base*Factories
could still contain a {{protected Version luceneMatchVersion}} set by the base init() method
that subclasses could use as needed.

NOTE: This still doesn't doesn't solve the "Analyzers must have no-arg constructors" part
of hte issue -- but it doesn't make it worse.  We can make IndexSchema pass this.getLuceneMatchVersion()
to any Analyzer with a single arg "Version" constructor fairly easily.  If/When we provide
a more general mechanism for passing constructor args to Analyzers, any Version params could
be defaulted just like with the factory init() methods.

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message