lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: deprecating Versions
Date Mon, 29 Nov 2010 14:05:05 GMT

On Nov 29, 2010, at 5:34 AM, Robert Muir wrote:

> On Mon, Nov 29, 2010 at 2:50 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
>> And for indexes:
>> * Index compatibility is guaranteed across two adjacent major
>> releases. eg 2.x -> 3.x, 3.x -> 4.x.
>>  That includes both binary compat - codecs, and semantic compat -
>> analyzers (if appropriate Version is used).
>> * Older releases are most probably unsupported.
>>  e.g. 4.x still supports shared docstores for reading, though never
>> writes them. 5.x won't read them either, so you'll have to at least
>> fully optimize your 3.x indexes when going through 4.x to 5.x.
>> 
> 
> Is it somehow possible i could convince everyone that all the
> analyzers we provide are simply examples?

It really doesn't solve the problem. Analyzers are not much more than tokenizer and zero or
more filters chained in an ordered manner. Right now, the "more" is the special code regarding
reuse.

In my project, I don't use any of the Analyzers that Lucene provides, but I have variants
of them. (Mine allow take flags indicating whether to filter stop words and whether to do
stemming). The effort recently has been to change these analyzers to follow the new reuse
pattern to improve performance.

Having a declarative mechanism and I wouldn't have needed to make the changes.

WRT to an analyzer, if any of the following changes, all bets are off:
    Tokenizer (i.e. which tokenizer is used)
    The rules that a tokenizer uses to break into tokens. (E.g. query parser, break iterator,
...)
    The type associated with each token (e.g. word, number, url, .... )
    Presence/Absence of a particular filter
    Order of filters
    Tables that a filter uses
    Rules that a filter encodes
    The version and implementation of Unicode being used (whether via ICU, Lucene and/or Java)
    Bugs fixed in these components.
(This list is adapted from an email I wrote to a user's group explaining why texts need to
be re-indexed.)

Additionally, it is the user's responsibility to normalize the text, probably to NFC or NFKC,
before index and search. (It may need to precede the Tokenizer if it is not Unicode aware.
E.g. what does a LetterTokenizer do if input is NFD and it encounters an accent?)

Recently, we've seen that there is some mistrust here in JVMs at the same version level from
different vendors (Sun, Harmony, IBM) in producing the same results. (IIRC: Thai break iterator.
Random tests.)

For the most part, searching the index will seem to be fine. It may only be edge cases that
cause problems.

Adding documents to an index with a changed Analyzer might not be a good thing. It might result
in a question of "Why does my search find this Document, but not that Document. Both should
be returned.")

Within a release of Lucene, a small handful of analyzers may have changed sufficiently to
warrant re-index of indexes built with them.

For me the bigger problem is that the parts of analyzer are not separately versioned. It is
not simply a matter of using a lucene-analyzers-XX.YY.jar. That is too coarse grained. Each
release has new goodness regarding analysis of non-english texts and performance regarding
all texts. If I want any or all of that, I have two choices:
a) Upgrade and rebuild every index. Since the desktop application does not know if a change
requires rebuild, everything must be rebuilt.
or
b) Fork all the components I use. (To me this is just wrong, but perhaps necessary/expedient.)
or
c) version the names of the packages and/or classes. (I don't like this idea either, but it
works)

Given that the releases of Lucene and my application are infrequent (so much for the release
often mantra) forcing a rebuild is not such a horrible thing for me.

So basically, I have given up on Lucene being backward compatible where it matters the most
to me: Stable analyzer components. The gain I get from this admission is far better. YMMV.

Hope this helps,
	DM


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message