lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: deprecating Versions
Date Mon, 29 Nov 2010 19:14:53 GMT
On 11/29/2010 01:03 PM, Robert Muir wrote:
> On Mon, Nov 29, 2010 at 12:51 PM, DM Smith<dmsmith555@gmail.com>  wrote:
>> I'd have to look to be sure: IIRC, Turkish was one. The treatment of 'i' was
>> buggy. Russian had it's own encoding that was replaced with UTF-8. The
>> QueryParser had bug fixes. There is some effort to migrate away from stemmer
>> to snowball, but at least the Dutch one is not "identical".
>>
> but none of these broke backwards compatibility, they all respect the
> Version constant!
> The SnowballAnalyzer respects the version constant for the buggy
> turkish lowercasing! If you use VERSION.LUCENE_30 (or less) it wrongly
> lowercases so you get your old buggy behavior.
>
> Even the old buggy Dutch stemmer is still there, and if you use
> DutchAnalyzer(Version.LUCENE_30) (or less) it stems incorrectly so you
> get your old buggy behavior!
>
> The russian was the same way, same with the QueryParser.
>
> So I'm sorry, I am left confused about where the backwards breaks are?
Strictly speaking there are none, in the present. The user of Lucene can 
choose to break compatibility and retain old (and in these cases, buggy) 
behavior. This maintains Lucene's bw-compat policy.

This thread talked about removing the Version constants in the future? I 
went back and re-read the thread. Perhaps I misunderstood. I saw several 
thoughts:
Deprecate  version constants 1 version back and remove those 2 versions 
back.
Remove all version constants and use versioned jars instead.

If there is no way to select a prior behavior except to select a single 
jar that had lots of analyzers (or analyzer parts) in it, then I'm stuck 
with older code that is perhaps buggy. I can't pick a later analyzer for 
English and an earlier, buggy analyzer for Turkish. I have to get all of 
them from one jar. (Unless we get into renaming packages and/or 
classes). So I can't get some improvements while ignoring others.

I think there is a problem with deprecating and removing constants too. 
In trunk, which will be 4.0, it needs to be able to read and/or upgrade 
2.x indexes. From an analyzer perspective, an index is invalid if the 
analyzer would produce a different token stream for the same input. If 
the 2.x version constants are gone, then the index built with 2.x 
version constants is no longer valid. (It might be valid, but how can 
one have any confidence of that?) Upgrading the index to the new 
internal format cannot change this. A buggy lowercase Turkish word will 
still be buggy after upgrade. (This is a 3.0 version constant that in 
5.0 will still need to be around).

We either need more frequent releases (forcing the issue earlier and 
eliminating stale code earlier) or something's gotta give.

That said. As a user, I don't care any more. I'll give. The benefit of a 
better index outweighs backward compatibility for me.

-- DM


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message