Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DA4F11150D for ; Fri, 1 Aug 2014 23:48:03 +0000 (UTC) Received: (qmail 14024 invoked by uid 500); 1 Aug 2014 23:48:03 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 13951 invoked by uid 500); 1 Aug 2014 23:48:03 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 13940 invoked by uid 99); 1 Aug 2014 23:48:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Aug 2014 23:48:02 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [209.85.223.180] (HELO mail-ie0-f180.google.com) (209.85.223.180) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 01 Aug 2014 23:47:59 +0000 Received: by mail-ie0-f180.google.com with SMTP id at20so6790729iec.11 for ; Fri, 01 Aug 2014 16:47:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to :content-type:content-transfer-encoding; bh=g3sZFl0lXXNhz2Mr5wwVVZhIuElom02Fa0f03ROQ7Uk=; b=T+kn2ZhArOwlgtuy3EbnCsTNXQBvHvIf+whilge+/4tFgSZwjTxakQxPbJ5PNa797M RaBTueQbJubRcEx/ZXjp6EmhTS9WHe1fI+M7ldSO4YmAY7R+Sr6hkU+zLax5r+78OcEd FIeS9m1YekIJEddr2sGt74PMybs/1YQQr1j1r3dBPGFouzMLoiCoEQ4cqiGx04LJ6DLx gbBgP9FFz/YYOap1WZDwCPhkAEGMXPUeuENfWrRyK1e1NN47U88h6lexh79HbxHsUs24 P+S9dipxHUHa1kBL3Oo184o5JhRx6BYYlavvIYoItOKOrLhCTydMG8pVE98dNom7HWWV YwRQ== X-Gm-Message-State: ALoCoQlWjEBxamy6xHnf5Lx+XawinUvNpSQZQgKXyVWAhypjmiVYrkpbDwhkTjYeDERX771yLT88 X-Received: by 10.51.17.66 with SMTP id gc2mr1535765igd.40.1406936853589; Fri, 01 Aug 2014 16:47:33 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.32.202 with HTTP; Fri, 1 Aug 2014 16:47:12 -0700 (PDT) From: Ryan Ernst Date: Fri, 1 Aug 2014 16:47:12 -0700 Message-ID: Subject: Lucene versioning logic To: dev@lucene.apache.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org There has been a lot of heated discussion recently about version tracking in Lucene [1] [2]. I wanted to have a fresh discussion outside of jira to give a full description of the current state of things, the problems I have heard, and a proposed solution. CURRENT We have 2 pieces of code that handle =E2=80=9Cversioning.=E2=80=9D The fir= st is Constants.LUCENE_MAIN_VERSION, which is written to the SegmentsInfo for each segment. This is a string version which is used to detect when the current version of lucene is newer than the version that wrote the segment (and how/if an upgrade to to a newer codec should be done). There is some complication with the =E2=80=9Cdisplay=E2=80=9D versio= n and non-display version, which are distinguished by whether the version of lucene was an official release, or an alpha/beta version (which was added specifically for the 4.0.0 release ramp up). This string version also has its own parsing and comparison methods. The second piece of versioning code is in Version.java, which is an enum used by analyzers to maintain backwards compatible behavior given a specific version of lucene. The enum only contains values for dot releases of lucene, not bug fixes (which was what spurred the recent discussions over version). Analyzers=E2=80=99 constructors take a required Version parameter, which is only actually used by the few analyzers that have changed behavior recently. Version.java contains a separate version parsing and comparison methods. CONCERNS * Having 2 different pieces of code that do very similar things is confusing for development. Very few developers appear to really understand the current system (especially when trying to understand the alpha/beta setup). * Users are generally confused by the Version passed to analyzers: I know I was when I first started working with Lucene, and Version.CURRENT_VERSION was deprecated because users used that without understanding the implications. * Most analyzers currently have dead code constructors, since they never make use of Version. There are also a lot of classes used by analyzers which contain similar dead code. * Backwards compatibility needs to be handled in some fashion, to ensure users have a path to upgrade from one version of lucene to another, without requiring immediate re-indexing. PROPOSAL I propose the following: * Consolidate all version related enumeration, including reading and writing string versions, into Version.java. Have a static method that returns the current lucene version (replacing Constants.LUCENE_MAIN_VERSION). * Make bug fix releases first class in the enumeration, so that they can be distinguished for any compatibility issues that come up. * Remove all snapshot/alpha/beta versioning logic. Alpha/beta was really only necessary for 4.0 because of the extreme changes that were being made. The system is much more stable now, and 5.0 should not require preview releases, IMO. I don=E2=80=99t think snapshots should be a concern because any user building an index from an unreleased build (which they built themselves) is just asking for trouble. They do so at their own risk (of figuring out how to upgrade their indexes if they are not trash-able). Backwards compatibility can be handled by adding the alpha/beta/final versions of 4.0 to the enum (and special parsing logic for this). If lucene changes so much that we need alpha/beta type discrimination in the future, we can revisit the system if simply having extra versions in the enum won't work. * Analyzers constructors should have Version removed, and a setter should be added which allows production users to set the version used. This way any analyzers can still use version if it is set to something other than current (which would be the default), but users simply prototyping do not need to worry about it. * Classes that analyzers use, which take Version, should have Version removed, and the analyzers should choose which settings/variants of those classes to use based on the version they have set. In other words, all version variant logic should be contained within the analyzers. For example, Lucene47WordDelimiterFilter, or StandardAnalyzer can take the unicode version. Factories could still take Version (e.g. TokenizerFactory, TokenFilterFactory, etc) to produce the correct component (so nothing will change for solr in this regard). I=E2=80=99m sure not everyone will be happy with what I have proposed, but = I=E2=80=99m hoping we can work out a solution together, and then implement in a team-like fashion, the way I have seen the community work in the past, and I hope to see again in the future. Thanks Ryan [1] https://issues.apache.org/jira/browse/LUCENE-5850 [2] https://issues.apache.org/jira/browse/LUCENE-5859 --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org