lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hoss Man (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1677) Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
Date Wed, 20 Jan 2010 01:36:54 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802609#action_12802609
] 

Hoss Man commented on SOLR-1677:
--------------------------------

I'm definitely of two minds on this.

On the one hand...

Robert's clarification of his concerns convinces me that we don't need a global setting. 
The issue of multiple related components in an analysis chain (ie: EsperantoTokenizer, EsperantoStopFilter,
and EsperantoStemmerFilter) not being well tested in Lucene-Java when those components use
differnet Version proeprties doesn't seem like a compelling argument because we've never made
any claims that any combinations of analysis componets will work together.  People can easily
construct Analyzers in their schema.xml that make no sense, and don't work at all, we'll never
be able to solve that problem for everyone.   Worrying about people miss-matching version
numbers doesn't seem any different then worrying about them using inconsistent stopword files
between an index analyzer and a query analyzer on the same field: buyer beware.

On the other hand...

I view the Version property of all these Lucene-Java classes an as implementation detail of
the generalized ideal of providing multiple solutions for a similar problem that have subtly
differnet behavior.  To my mind: Adding a version property to StandardTokenizer is just an
alternate approach to deprecating StandardTokenizer and providing a new StadanrdTokenizer2
where the behavior is "improved" based on the subjective opinion of the Lucene community.
 The Version property approach is easier to maintain in the Lucene source tree, but still
requires roughly the same amount of work on the part of client app maintainers when upgrading:
consider whether you think the "improved" behavior is better for your application, and modify
your code as needed.  I've been looking at how this should be supported in Solr with that
perspective, putting the schema.xml owner in the role of the client app maintainer.

But I'm realizing now that I'm clearly in the minority in viewing these multiple versions
as "alternate implementations" ... everyone else seems to have a very fixed view that these
Version based changes are genuine improvements/bug-fixes, w/o any expectation that clients
might/could subjective decide "i want the old behavior" and that older "Versions" are supported
purely for back-compatibility.

If that's how Version is really going to be used in Lucene-Java moving forward, then I can
definitely understand the push for having it globally configured in Solr for simplification.

----

I won't fight you guys on this ... if I'm the only one that feels like a global value is bad,
then i concede that probably says more about me then about the idea.

But I'm still really worried about the problem of (opaque) action at a distance, and the difficulties
in understanding what effects there will be when changing the luceneVersionMatch property
from one value to another.

This comment from Mark illustrates what scares me the most...

bq. it should say, if you change this, you must reindex. No worries about action at a distance.
The action is to get the latest and greatest Lucene has to offer rather than older buggy or
back compat behavior.

...that mindset, that as long as you reindex you'll be fine, totally downplays the fact that
changes will happen in places the user may not realize.  w/o a clear way of knowing what exactly
is changing when you modify that (global) value, users will have no idea what to look for
when they "upgrade" it.  they won't have any visibility into what the fully set of behavior
changes to exepect as a result of that update, to know what they should test to make sure
it still works the way they need it to.

If they read in mailing list thread that they need to switch from {{<luceneMatchVersion>2.4</luceneMatchVersion>}}
to {{<luceneMatchVersion>2.9</luceneMatchVersion>}} and completley reindex in
order to get positions to be preserved in StopFilterFactory, that doesn't help them realize
that they should do relevancy testing on fieldA and fieldB which use some language specific
stemmer whose behavior changed in a small but significant way.

As a user, that's the nightmare scenario i don't want to have to deal with:  greping through
every class in Lucene-Java that has a Version property to see which ones have differnet behavior
between the luceneMatchVersion property i'm currently using and the luceneMatchVersion property
i've been told i should upgrade to in order to fix a bug ... just so i know what things i
need to test after i make my change.

I guess this is will just be a documentation problem, but it seems like a pretty fucking big
one.



> Add support for o.a.lucene.util.Version for BaseTokenizerFactory and BaseTokenFilterFactory
> -------------------------------------------------------------------------------------------
>
>                 Key: SOLR-1677
>                 URL: https://issues.apache.org/jira/browse/SOLR-1677
>             Project: Solr
>          Issue Type: Sub-task
>          Components: Schema and Analysis
>            Reporter: Uwe Schindler
>         Attachments: SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch, SOLR-1677.patch
>
>
> Since Lucene 2.9, a lot of analyzers use a Version constant to keep backwards compatibility
with old indexes created using older versions of Lucene. The most important example is StandardTokenizer,
which changed its behaviour with posIncr and incorrect host token types in 2.4 and also in
2.9.
> In Lucene 3.0 this matchVersion ctor parameter is mandatory and in 3.1, with much more
Unicode support, almost every Tokenizer/TokenFilter needs this Version parameter. In 2.9,
the deprecated old ctors without Version take LUCENE_24 as default to mimic the old behaviour,
e.g. in StandardTokenizer.
> This patch adds basic support for the Lucene Version property to the base factories.
Subclasses then can use the luceneMatchVersion decoded enum (in 3.0) / Parameter (in 2.9)
for constructing Tokenstreams. The code currently contains a helper map to decode the version
strings, but in 3.0 is can be replaced by Version.valueOf(String), as the Version is a subclass
of Java5 enums. The default value is Version.LUCENE_24 (as this is the default for the no-version
ctors in Lucene).
> This patch also removes unneeded conversions to CharArraySet from StopFilterFactory (now
done by Lucene since 2.9). The generics are also fixed to match Lucene 3.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message