lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Lucene's default settings & back compatibility
Date Tue, 19 May 2009 12:56:08 GMT

On May 19, 2009, at 8:19 AM, Michael McCandless wrote:

> On Tue, May 19, 2009 at 7:26 AM, Grant Ingersoll  
> <> wrote:
>> I don't think we have said that bug fixes are required to be back
>> compatible, even if it is in analysis.  I think it is a really bad  
>> idea for
>> TokenStreams to have if clauses in them checking boolean values for  
>> old
>> versus new behaviors.
>> When they can be back compat, we do, but there is not a  
>> requirement.  For
>> instance, we upgraded Snowball.
> True (Snowball), but then we have discussions like this:

> #action_12550948
> which added a confusing deprecated "boolean replaceDepAcronym =
> false;" to StandardAnalyzer.  Something similar led to
> StandardAnalyzer.replaceInvalidAcronym.
> I think there have been other cases (in particular StandardAnalyzer,
> QueryParser) over time, but I haven't tracked them down.  Analyzer
> back compat after fixing issues is especially tricky since the bugs
> get "cached" into the index and queries against that index using the
> fixed analyzer may not longer match the docs.  (So I think back-compat
> is important in Analyzers).
>> Or, the removal of StopFilter as "Standard" all together.  This  
>> coupled with
>> a QP that created phrases around stop words is a better solution.
> Interesting... that'd be a pretty big change to StandardAnalyzer,
> though.
> I can see we are spinning off lots of neat ideas, decoupled from the
> "Settings" proposal, here :)
>> For instance, if we removed the StopFilter from the  
>> StandardAnalyzer, then
>> what?  A Settings object would not be able to account for it.
> Why not?  The settings object could have say a property
> "analysis.standard.enableStopFilter"?

And what if it is something that has to be called in the next() chain  
and not during construction?  Are you going to want to call that every  
single time over millions upon millions of tokens in a large  
collection?   Even if it is during construction, you still might end  
up calling it a lot of times.

>> Likewise, the subtler issue of "fixing" a TokenStream such that it
>> might produce different tokens.
> Settings should cover this in general, I think.
>> I really worry about Settings objects having to be repeatedly  
>> checked inside
>> of tight inner loops.  Even looking at the new TokenStream stuff,  
>> there are
>> now checks for the "new API" in an area that is called _a lot_ of  
>> times.
> Agreed, but I'd say this is orthogonal.  We should never do slow
> things inside inner loops -- checking settings, calling logging
> frameworks, calling List.size(), opening files, etc.  This is the
> stuff of standard coding practices...

There's a difference between std. coding practices and purposefully  
putting in lots of if checks to solve back compatibility issues that  
are created in order to satisfy some naming convention.  Given the  
length of time between releases, we could easily call every new  
release a major version and we wouldn't be all that different from  
most commercial projects.  I'd bet if we switched from calling things  
major.minor and just called them Lucene '09 and Lucene '10 people  
would be just fine with the changes.

I've said it before and I'll say it again.  Given the time between  
Lucene releases (at least 6 mos. for minor releases and 1+ year for  
majors) we have _PLENTY_ of time to let users know what is coming and  
plan accordingly.   By being so dogmatic about back compatibility, I  
believe we are making it harder to innovate and harder for new people  
to contribute and we keep cruft around for way too long.  (How the  
heck is a new contributor supposed to keep track of all the things  
that went into Lucene for the past 1.5 years?)  I'm not saying we  
should throw back compat. out the window, I'm just saying we should  
take it more on a case by case basis, with the default, obviously,  
being to favor back compatibility.  The large majority of users  (I'd  
venture to say well north of 95% of them) will be able to deal with  
minor API changes every 6 to 8 months, especially if we are more  
proactive about communicating them to java-user@ and in CHANGES.  In  
fact, if we announced changes that are going to break for not the next  
version, but the one after, it would give people lots of time to adapt.

>> Last, and mostly I mention it as an afterthought.  How are you  
>> going to
>> handle changes to the Settings?  Say, for instance, we come out w/
>> Settings2.4, release it and then we realize we missed something  
>> (and this
>> seems likely given the number of settings available in Lucene), then
>> what?
>> We deprecate Settings2.4 and come out with TheRealSettingsFor2.4?   
>> And then
>> when that is incomplete?
> Well, in 2.9 there would still be a Settings2.4 class, but it'd have
> newly created (in 2.9) settings with their defaults bound.
> So in 2.9, when sorting by field you can optionally turn off scoring.
> It gives a sizable performance boost doing so.  We of course were
> forced to leave scoring on for back compat, but if we had this
> Settings class online what we would have done instead is add a new
> "search.sort.trackScores" (and, "trackMaxScore") setting to the base
> Settings class, but the Settings2.4 would bind it to true.
> There should be no need to make a new class for 2.4's settings on
> releasing 2.9?

I think you missed the point.  The problem lies in releasing 2.4's  
settings and those settings are wrong.  Using your example, say  
Settings24 was messed up and set trackMaxScore to true when it should  
have been false (mistakes happen).  It gets released in 2.9 as the  
settings for 2.4 back compatibility.  We then realize our mistake.   
How do you fix it?  You can't just set it to false, b/c now you have  
users who are depending, potentially, on the _wrong_ version.  So, now  
you have to deprecate it and come out with a "new" Settings2.4 called  
something else.

>> I still think we would benefit from just communicating upcoming  
>> changes
>> better even in minor releases, thereby allowing for a bit more  
>> variance in
>> back compat.  It should be the exception, not the rule.
> I like DM's point, that this Settings class would be a great vehicle
> for exactly that communication.  Rather than pouring over a
> CHANGES.txt, you can see setting-by-setting what changed, and why.

Sorry, I'd rather read CHANGES.  It is the one place we all make sure  
to enter our changes.  People aren't as good about javadocs,  
especially accessors where the name is "self explanatory".  Plus it  
has a link to a JIRA issue.

Also, how useful is it going to be to have 30 or 40 (hundreds?)  
accessors on a single Settings object?  So, then, the logical thing to  
do is to split it up and have some nested way of doing things.  And  
then people will be tired of having to programmatically set all the  
values, so they will create a config/properties file that does it.   
But, because we don't like dependencies, we will re-invent how that  
works.  After it's all said and done, you end up having re-invented IOC.

Another interesting thing to think about is how do we sunset old  
settings objects.  When we are on 4.X, should we still keep around 2.4  
settings?  Not really something we necessarily need to solve right now.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message