lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <>
Subject [jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Fri, 15 Jan 2010 00:44:54 GMT


Hoss Man commented on SOLR-1677:

bq. And I also can't see anyone really spending time to aggressively ensure that the example
schema etc is all up to date

I think you are vastly underestimating how much work is spent reviewing the example schema.xml
prior to releases.  It would be trivial to search/replace luceneMatchVersion="X" with luceneMatchVersion="Y"
anytime the "current" version of Version was updated in Lucene-Java

bq. the hardcoded 2.4 behavior is the action at a distance, because if i do not specify Version
in my configuration file, then i get this very old behavior.

I don't follow you at all -- you have identified no action, or distance in your example.

When i say i'm worried about scary action at a distance, i'm talking about editing some thing
A in a config file, and having it result in changed behavior (action) in things B, C and D
that do not directly refer to A in any way (distance).  Further more these changes in behavior
are silent (thus scary).

If I have {{<fieldType name="A"/>}} and much later in the config {{<field name="B"
type="A"/>}} the editing A results in and action on B at a distance -- but this should
not suprise me at all because B explicitly refrences A.

Having a global {{<luceneMatchVersion/>}} tag that affects the behavior of a variety
of different things when it's modified leads to situations where people might change that
value triggering changes in many components w/o a clear idea of what might have changed --
so they don't even know what things they should focus on testing for correctness after makign
that change.

The existing {{<schema version="X"/>}} property also leads to action at a distance type
situations -- but that is a lot less scary to me because at least with it there is a uniform
set of changes to *all* schema objects between any two versions, so it's easy to document
what cahnges when you go from 1.1 to 1.2, or 1.2 to 1.3 ... but with luceneMatchVersion the
potential changes are unique to every individual Class that cares about it.

If this is really your concern, then i have an alternative i propose.

* No default anywhere, not even in the code
* Version is mandatory if the thing requires it

This is something Uwe and i both discussed in previous comments... i said: i'm fine with this idea in theory -- as a long term plan -- but there has to
be a gradual migration process for people. ie: it can be required on certain objects in a
future release, but for at least the next release it needs to be possible to not specify the
luceneMatchVersion on all of these objects, and when people use them w/o specifying, they
can log big fat warnings on initi that it is defaulting to 2.4, and they should set the property
explicitly if that's what they want.


bq. I still do not want it in schema.xml, as Version is a global Lucene thing!

Uwe: I think you are missunderstanding the reason for a distinction between solrconfig.xml
and schema.xml in Solr.  If (for hte sake of argument) luceneMatchVersion really should be
a "global Lucene thing" then that is precisely why it should be in schema.xml.

schema.xml is for configuration that is inheriently part of the index, and must be consistent
regardless of who/how/why that index is being used.  solrconfig.xml is where settings are
put that are specific to how a a particular instance of an index is being used.   If a setting
is in solrconfig.xml, then it should to be possible for that setting to be completley different
on differnet solr instances that use the exact same schema.xml -- even if they use cloned
copies of the same index directory. (ie: master/slave distinctions in replication; peer slaves
with distinct handler/cache settings to serve distinct use cases; etc...)

That's the reason why nothing that hangs off of IndexSchema is currently allowed to be SolrCoreAware,
or get access to the SolrConfig object (and the SolrResourceLoader abstraction was created)
... nothing about the SolrCore "instance" should be allowed to influence the resulting index,
because that index may later be used on a differnet instance with a different config.

As i mentioned before: solrconfig.xml can depend on schema.xml, but schema.xml can not depend
on solrconfig.xml

So if a global luceneMatchVersion can affect the behavior of an analyzer or FieldType in a
way that is "persisted" as part of hte index -- and other classes (like QueryParser in Robert's
example) need to make sure to use the same luceneMatchVersion to behave correctly with that
index, then that setting needs to be in the schema.xml so it is consistent no matter how/where
that index and schema.xml file are used.

Does that make sense?


I'd still like to clarify this whole issue of wether "Lucene-Java", as a project, has an expectation
that client applications will always use a consistent value for Version when constructing
objects that interact with an index, as Robert alluded to in a previous comment...

bq. I don't think Version is intended so you can use X.Y on this part and Y.Z on this part

This was not my impression when Version was added -- but i freely admit I wasn' paying that
much attention.

In Uwe's comment he implied (but didn't actually state) that he concurred with Robert...

bq. ...Version is a global Lucene thing...

*Iff* that expectation really is true in Lucnee-Java, and *iff* there really is an expectation
that using multiple Version values withing Solr is likely to cause people problems as objects
interact, then it seems to be that it be a very bad idea to offer to any sort of out of the
box support for per object overriding of luceneMatchVersion in our solrconfig.xml/schema.xml.

i know, i know ... this is a complete 180 from my previous claim that we should _only_ have
per object configuration -- a claim that i still stand behind if Lucene-Java "supports" applications
using multiple values of Version, but if that is not considered "supported" and if changes
are actively being made in Lucene-Java that explicitly assume consistent Version usage, then
I'm not convinced it owuld be a good idea to enable people to tweak things in that way.  Anyone
who understands the underlying Java code enough to appreciate the nuances of using A.B in
one place and X.Y in another place can write their own Factory that looks at a luceneMatchVersion
nit param -- the out of hte box ones should stick with the global setting.

BUT!!!!! ... those are Big "IFFs" ... 

* Uwe: do you concur with Robert?
* Are there any threads/docs about the expecations of Version homo/hetero-genousness in Lucene-Java?

> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>                 Key: SOLR-1677
>                 URL:
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message