lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From DM Smith <dmsmith...@gmail.com>
Subject Re: Proposal about Version API "relaxation"
Date Thu, 15 Apr 2010 17:30:38 GMT
On 04/15/2010 09:49 AM, Robert Muir wrote:
> wrong, it doesnt fix the analyzers problem.
>
> you need to reindex.
>
> On Thu, Apr 15, 2010 at 9:39 AM, Earwin Burrfoot <earwin@gmail.com 
> <mailto:earwin@gmail.com>> wrote:
>
>     On Thu, Apr 15, 2010 at 17:17, Yonik Seeley
>     <yonik@lucidimagination.com <mailto:yonik@lucidimagination.com>>
>     wrote:
>     > Seamless online upgrades have their place too... say you are
>     upgrading
>     > one server at a time in a cluster.
>
>     Nothing here that can't be solved with an upgrade tool. Down one
>     server, upgrade index, upgrade sofware, up.
>

Having read the thread, I have a few comments. Much of it is summary.

The current proposal requires re-index on every upgrade to Lucene. Plain 
and simple.

Robert is right about the analyzers.

There are three levels of backward compatibility, though we talk about 2.

First, the index format. IMHO, it is a good thing for a major release to 
be able to read the prior major release's index. And the ability to 
convert it to the current format via optimize is also good. Whatever is 
decided on this thread should take this seriously.

Second, the API. The current mechanism to use deprecations to migrate 
users to a new API is both a blessing and a curse. It is a blessing to 
end users so that they have a clear migration path. It is a curse to 
development because the API is bloated with the old and the new. Further 
it causes unfortunate class naming, with the tendency to migrate away 
from the good name. It is a curse to end users because it can cause 
confusion.

While I like the mechanism of deprecations to migrate me from one 
release to another, I'd be open to another mechanism.  So much effort is 
put into API bw compat that might be better spent on another mechanism. 
E.g. thorough documentation.

Third, the behavior. WRT, Analyzers (consisting of tokenizers, stemmers, 
stop words, ...) if the token stream changes, the index is no longer 
valid. It may appear to work, but it is broken. The token stream applies 
not only to the indexed documents, but also to the user supplied query. 
A simple example, if from one release to another the stop word 'a' is 
dropped, then phrase searches including 'a' won't work as 'a' is not in 
the index. Even a simple, obvious bug fix that changes the stream is bad.

Another behavior change is an upgrade in Java version. By forcing users 
to go to Java 5 with Lucene 3, the version of Unicode changed. This in 
itself causes a change in some token streams.

With a change to a token stream, the index must be re-created to ensure 
expected behavior. If the original input is no longer available or the 
index cannot be rebuilt for whatever reason, then lucene should not be 
upgraded.

It is my observation, though possibly not correct, that core only has 
rudimentary analysis capabilities, handling English very well. To handle 
other languages well "contrib/analyzers" is required. Until recently it 
did not get much love. There have been many bw compat breaking changes 
(though w/ version one can probably get the prior behavior). IMHO, most 
of contrib/analyzers should be core. My guess is that most non-trivial 
applications will use contrib/analyzers.

The other problem I have is the assumption that re-index is feasible and 
that indexes are always server based. Re-index feasibility has already 
been well-discussed on this thread from a server side perspective. There 
are many client side applications, like mine, where the index is built 
and used on the clients computer. In my scenario the user builds indexes 
individually for books. From the index perspective, the sentence is the 
Lucene document and the book is the index. Building an index is 
voluntary and takes time proportional to the size of the document and 
time inversely proportional to the power of the computer. Our user base 
are those with ancient, underpowered laptops in 3-rd world countries. On 
those machines it might take 10 minutes to create an index and during 
that time the machine is fairly unresponsive. There is no opportunity to 
"do it in the background."

So what are my choices? (rhetorical) With each new release of my app, 
I'd like to exploit the latest and greatest features of Lucene. And I'm 
going to change my app with features which may or may not be related to 
the use of Lucene. Those latter features are what matter the most to my 
user base. They don't care what technologies are used to do searches. If 
the latest Lucene jar does not let me use Version (or some other 
mechanism) to maintain compatibility with an older index, the user will 
have to re-index. Or I can forgo any future upgrades with Lucene. 
Neither are very palatable.

-- DM Smith






Mime
View raw message