On 04/15/2010 09:49 AM, Robert Muir wrote:
wrong, it doesnt fix the analyzers problem.

you need to reindex.

On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
On Thu, Apr 15, 2010 at 17:17, Yonik Seeley <yonik@lucidimagination.com> wrote:
> Seamless online upgrades have their place too... say you are upgrading
> one server at a time in a cluster.

Nothing here that can't be solved with an upgrade tool. Down one
server, upgrade index, upgrade sofware, up.

Having read the thread, I have a few comments. Much of it is summary.

The current proposal requires re-index on every upgrade to Lucene. Plain and simple.

Robert is right about the analyzers.

There are three levels of backward compatibility, though we talk about 2.

First, the index format. IMHO, it is a good thing for a major release to be able to read the prior major release's index. And the ability to convert it to the current format via optimize is also good. Whatever is decided on this thread should take this seriously.

Second, the API. The current mechanism to use deprecations to migrate users to a new API is both a blessing and a curse. It is a blessing to end users so that they have a clear migration path. It is a curse to development because the API is bloated with the old and the new. Further it causes unfortunate class naming, with the tendency to migrate away from the good name. It is a curse to end users because it can cause confusion.

While I like the mechanism of deprecations to migrate me from one release to another, I'd be open to another mechanism.  So much effort is put into API bw compat that might be better spent on another mechanism. E.g. thorough documentation.

Third, the behavior. WRT, Analyzers (consisting of tokenizers, stemmers, stop words, ...) if the token stream changes, the index is no longer valid. It may appear to work, but it is broken. The token stream applies not only to the indexed documents, but also to the user supplied query. A simple example, if from one release to another the stop word 'a' is dropped, then phrase searches including 'a' won't work as 'a' is not in the index. Even a simple, obvious bug fix that changes the stream is bad.

Another behavior change is an upgrade in Java version. By forcing users to go to Java 5 with Lucene 3, the version of Unicode changed. This in itself causes a change in some token streams.

With a change to a token stream, the index must be re-created to ensure expected behavior. If the original input is no longer available or the index cannot be rebuilt for whatever reason, then lucene should not be upgraded.

It is my observation, though possibly not correct, that core only has rudimentary analysis capabilities, handling English very well. To handle other languages well "contrib/analyzers" is required. Until recently it did not get much love. There have been many bw compat breaking changes (though w/ version one can probably get the prior behavior). IMHO, most of contrib/analyzers should be core. My guess is that most non-trivial applications will use contrib/analyzers.

The other problem I have is the assumption that re-index is feasible and that indexes are always server based. Re-index feasibility has already been well-discussed on this thread from a server side perspective. There are many client side applications, like mine, where the index is built and used on the clients computer. In my scenario the user builds indexes individually for books. From the index perspective, the sentence is the Lucene document and the book is the index. Building an index is voluntary and takes time proportional to the size of the document and time inversely proportional to the power of the computer. Our user base are those with ancient, underpowered laptops in 3-rd world countries. On those machines it might take 10 minutes to create an index and during that time the machine is fairly unresponsive. There is no opportunity to "do it in the background."

So what are my choices? (rhetorical) With each new release of my app, I'd like to exploit the latest and greatest features of Lucene. And I'm going to change my app with features which may or may not be related to the use of Lucene. Those latter features are what matter the most to my user base. They don't care what technologies are used to do searches. If the latest Lucene jar does not let me use Version (or some other mechanism) to maintain compatibility with an older index, the user will have to re-index. Or I can forgo any future upgrades with Lucene. Neither are very palatable.

-- DM Smith