lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <>
Subject Re: deprecating Versions
Date Mon, 29 Nov 2010 22:22:52 GMT
On 11/29/2010 03:43 PM, Earwin Burrfoot wrote:
> On Mon, Nov 29, 2010 at 20:51, DM Smith<>  wrote:
>> The other thing I'd like is for the spec to be save along side of the index
>> as a manifest. From earlier threads, I can see that there might need to be
>> one for writing and another for reading. I'm not interested in using it to
>> construct an analyzer, but to determine whether the index is invalid wrt to
>> the analyzer currently in use.
> You can already implement such behaviour with 3.x branch of Lucene.
> It has IW.commit(Map<String, String>  userdata) method, that allows you
> to commit with arbitrary payload, that binds to segment and can be
> read back later.

Cool. I forgot entirely about that.

>> I think there is a problem with deprecating and removing constants too.
>> In trunk, which will be 4.0, it needs to be able to read and/or upgrade 2.x
>> indexes. From an analyzer perspective, an index is invalid if the analyzer
>> would produce a different token stream for the same input. If the 2.x
>> version constants are gone, then the index built with 2.x version
>> constants is no longer valid. (It might be valid, but how can one have any
>> confidence of that?) Upgrading the index to the new internal format
>> cannot change this. A buggy lowercase Turkish word will still be buggy
>> after upgrade. (This is a 3.0 version constant that in 5.0 will still need to be
> I think it was declared that Lucene does not provide index
> compatibility across more than a single major revision.
> Thus, we don't guarantee reading 2.x index with 4.0 Lucene. So, we can
> drop 2.x constants and compatibility.
> But we still have to support 3.x. In version 5.0 then we're dropping
> 3.x constants and support for bugs/deprecated
> features of 3.x.

Yes, you are correct that 4.0 may but is not guaranteed to read 2.x. My 
bad, yet again. I went back to the threads regarding this around May 25 
and it also was decided that 4.x might not be able to read 3.x, but will 
provide a migration tool in such a case.

That said, my point still stands. The 3.0 version constant which is used 
by an analyzer to preserve 3.0 behavior will need to be retained for the 
sake of analyzers in 5.0. Or the index will need to be rebuilt from 
original input. (I'm referencing the 3.0 rather than a 2.x because of 
the example I have in mind)

The tokens in the 3.0 index that is migrated to a 4.0 index still have 
tokens produced by an analyzer that was buggy. Example, a Turkish index 
with the wrong lower case i (Prior to LUCENE-2101, it would lowercase to 
i. After: İ (dotted capital I) => i ("regular" lower case i) and I 
("regular" upper case I) => 𝚤 (dotless lower case i)). This very 
commonly occurs in Turkish text. So the 4.0 index, still using 3.0 
version constant to get expected behavior, works as it always did.

Now in 5.0, there might be a migration tool or it will be able to read a 
4.x index. If the 3.0 constant is gone and none of these tokens are 
reachable. Search requests will have the correct lower case i and will 
not be able to find those with the wrong one. It will be very obvious.

Regarding this analyzer, code that uses a 2.x version constant for this 
analyzer will need to change to a 3.0 version constant in order for the 
index to be usable in the 4.x series if the 2.x constants are removed.

I don't think this is an isolated example.

With what's happening, every index that uses a deprecated version 
constant will have one very long major release cycle in which to rebuild 
their indexes from scratch.

And as I said at the bottom of my last email, I'm going to re-index 
because I am able and because I want correct behavior. So whatever is 
decided won't affect my application of Lucene.

-- DM

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message