Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Received-SPF: neutral (nike.apache.org: local policy)
MIME-Version: 1.0
From: Ryan Ernst <ryan@iernst.net>
Date: Fri, 1 Aug 2014 16:47:12 -0700
Message-ID: 
 <CA+DiXd4-fwKpybSdxw_d4NPan+sQuqj6ykLmrqj83gn=b5yhaA@mail.gmail.com>
Subject: Lucene versioning logic
To: dev@lucene.apache.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

There has been a lot of heated discussion recently about version
tracking in Lucene [1] [2].  I wanted to have a fresh discussion
outside of jira to give a full description of the current state of
things, the problems I have heard, and a proposed solution.

CURRENT

We have 2 pieces of code that handle =E2=80=9Cversioning.=E2=80=9D  The fir=
st is
Constants.LUCENE_MAIN_VERSION, which is written to the SegmentsInfo
for each segment.  This is a string version which is used to detect
when the current version of lucene is newer than the version that
wrote the segment (and how/if an upgrade to to a newer codec should be
done). There is some complication with the =E2=80=9Cdisplay=E2=80=9D versio=
n and
non-display version, which are distinguished by whether the version of
lucene was an official release, or an alpha/beta version (which was
added specifically for the 4.0.0 release ramp up).  This string
version also has its own parsing and comparison methods.

The second piece of versioning code is in Version.java, which is an
enum used by analyzers to maintain backwards compatible behavior given
a specific version of lucene.  The enum only contains values for dot
releases of lucene, not bug fixes (which was what spurred the recent
discussions over version). Analyzers=E2=80=99 constructors take a required
Version parameter, which is only actually used by the few analyzers
that have changed behavior recently.  Version.java contains a separate
version parsing and comparison methods.


CONCERNS

* Having 2 different pieces of code that do very similar things is
confusing for development.  Very few developers appear to really
understand the current system (especially when trying to understand
the alpha/beta setup).

* Users are generally confused by the Version passed to analyzers: I
know I was when I first started working with Lucene, and
Version.CURRENT_VERSION was deprecated because users used that without
understanding the implications.

* Most analyzers currently have dead code constructors, since they
never make use of Version.  There are also a lot of classes used by
analyzers which contain similar dead code.

* Backwards compatibility needs to be handled in some fashion, to
ensure users have a path to upgrade from one version of lucene to
another, without requiring immediate re-indexing.


PROPOSAL

I propose the following:

* Consolidate all version related enumeration, including reading and
writing string versions, into Version.java.  Have a static method that
returns the current lucene version (replacing
Constants.LUCENE_MAIN_VERSION).

* Make bug fix releases first class in the enumeration, so that they
can be distinguished for any compatibility issues that come up.

* Remove all snapshot/alpha/beta versioning logic.  Alpha/beta was
really only necessary for 4.0 because of the extreme changes that were
being made.  The system is much more stable now, and 5.0 should not
require preview releases, IMO.  I don=E2=80=99t think snapshots should be a
concern because any user building an index from an unreleased build
(which they built themselves) is just asking for trouble.  They do so
at their own risk (of figuring out how to upgrade their indexes if
they are not trash-able).  Backwards compatibility can be handled by
adding the alpha/beta/final versions of 4.0 to the enum (and special
parsing logic for this).  If lucene changes so much that we need
alpha/beta type discrimination in the future, we can revisit the
system if simply having extra versions in the enum won't work.

* Analyzers constructors should have Version removed, and a setter
should be added which allows production users to set the version used.
This way any analyzers can still use version if it is set to something
other than current (which would be the default), but users simply
prototyping do not need to worry about it.

* Classes that analyzers use, which take Version, should have Version
removed, and the analyzers should choose which settings/variants of
those classes to use based on the version they have set. In other
words, all version variant logic should be contained within the
analyzers.  For example, Lucene47WordDelimiterFilter, or
StandardAnalyzer can take the unicode version.
Factories could still take Version (e.g. TokenizerFactory,
TokenFilterFactory, etc) to produce the correct component (so nothing
will change for solr in this regard).

I=E2=80=99m sure not everyone will be happy with what I have proposed, but =
I=E2=80=99m
hoping we can work out a solution together, and then implement in a
team-like fashion, the way I have seen the community work in the past,
and I hope to see again in the future.

Thanks
Ryan

[1] https://issues.apache.org/jira/browse/LUCENE-5850
[2] https://issues.apache.org/jira/browse/LUCENE-5859

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org