Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 86671 invoked from network); 29 Nov 2010 22:23:26 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 29 Nov 2010 22:23:26 -0000 Received: (qmail 82610 invoked by uid 500); 29 Nov 2010 22:23:25 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 82560 invoked by uid 500); 29 Nov 2010 22:23:25 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 82553 invoked by uid 99); 29 Nov 2010 22:23:25 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Nov 2010 22:23:25 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of dmsmith555@gmail.com designates 209.85.213.176 as permitted sender) Received: from [209.85.213.176] (HELO mail-yx0-f176.google.com) (209.85.213.176) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 29 Nov 2010 22:23:15 +0000 Received: by yxm8 with SMTP id 8so2806423yxm.35 for ; Mon, 29 Nov 2010 14:22:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from :user-agent:mime-version:to:subject:references:in-reply-to :content-type:content-transfer-encoding; bh=7O6Hwj9zjVlobhcfjBHMqno3dqt2uUmc1kndHPhtsTs=; b=qmV9oFxbkQE1PSB6kVzcXuOZqV9RQsMc3lhEVeH0g3dtqVAsdkeXCFWXjpTOewLs8g rIFGVStdreqqp2oy6oARv0GISTXbbHuXlXwHZ0NrnBUW7OgAsoBzWkCCoeMkcLdjF9jA SA43h760usjg7MlwZuuGHYYiTLqMp/DKHaeOY= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; b=mX2zLSPWFVgTC5Zp0NJDV99kAbfS7WCvDq5DCogou1+IX3Pj29NbE7GGw4NZ9SfR2E FGiOGIbwGt0JlGXwEabxLpX7REg1GcNsFLufzp2QYMaci7GM7hQcQNwSOqXbjXrL7AHs aLTXile4ipMXjMCnG1r/aLXOw56eYH1R3CGBk= Received: by 10.42.164.134 with SMTP id g6mr1884876icy.187.1291069374496; Mon, 29 Nov 2010 14:22:54 -0800 (PST) Received: from localhost.localdomain (adsl-69-218-243-198.dsl.dytnoh.ameritech.net [69.218.243.198]) by mx.google.com with ESMTPS id z4sm6294197ibg.19.2010.11.29.14.22.53 (version=TLSv1/SSLv3 cipher=RC4-MD5); Mon, 29 Nov 2010 14:22:53 -0800 (PST) Message-ID: <4CF427BC.8090906@gmail.com> Date: Mon, 29 Nov 2010 17:22:52 -0500 From: DM Smith User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.12) Gecko/20101103 Fedora/1.0-0.33.b2pre.fc14 Lightning/1.0b2 Thunderbird/3.1.6 MIME-Version: 1.0 To: dev@lucene.apache.org Subject: Re: deprecating Versions References: <803582DC-078C-4DC6-BEFC-F66376E90959@gmail.com> <4CF3E827.7080607@gmail.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Virus-Checked: Checked by ClamAV on apache.org On 11/29/2010 03:43 PM, Earwin Burrfoot wrote: > On Mon, Nov 29, 2010 at 20:51, DM Smith wrote: >> The other thing I'd like is for the spec to be save along side of the index >> as a manifest. From earlier threads, I can see that there might need to be >> one for writing and another for reading. I'm not interested in using it to >> construct an analyzer, but to determine whether the index is invalid wrt to >> the analyzer currently in use. > You can already implement such behaviour with 3.x branch of Lucene. > It has IW.commit(Map userdata) method, that allows you > to commit with arbitrary payload, that binds to segment and can be > read back later. Cool. I forgot entirely about that. >> I think there is a problem with deprecating and removing constants too. >> In trunk, which will be 4.0, it needs to be able to read and/or upgrade 2.x >> indexes. From an analyzer perspective, an index is invalid if the analyzer >> would produce a different token stream for the same input. If the 2.x >> version constants are gone, then the index built with 2.x version >> constants is no longer valid. (It might be valid, but how can one have any >> confidence of that?) Upgrading the index to the new internal format >> cannot change this. A buggy lowercase Turkish word will still be buggy >> after upgrade. (This is a 3.0 version constant that in 5.0 will still need to be around). > I think it was declared that Lucene does not provide index > compatibility across more than a single major revision. > Thus, we don't guarantee reading 2.x index with 4.0 Lucene. So, we can > drop 2.x constants and compatibility. > But we still have to support 3.x. In version 5.0 then we're dropping > 3.x constants and support for bugs/deprecated > features of 3.x. Yes, you are correct that 4.0 may but is not guaranteed to read 2.x. My bad, yet again. I went back to the threads regarding this around May 25 and it also was decided that 4.x might not be able to read 3.x, but will provide a migration tool in such a case. That said, my point still stands. The 3.0 version constant which is used by an analyzer to preserve 3.0 behavior will need to be retained for the sake of analyzers in 5.0. Or the index will need to be rebuilt from original input. (I'm referencing the 3.0 rather than a 2.x because of the example I have in mind) The tokens in the 3.0 index that is migrated to a 4.0 index still have tokens produced by an analyzer that was buggy. Example, a Turkish index with the wrong lower case i (Prior to LUCENE-2101, it would lowercase to i. After: İ (dotted capital I) => i ("regular" lower case i) and I ("regular" upper case I) => 𝚤 (dotless lower case i)). This very commonly occurs in Turkish text. So the 4.0 index, still using 3.0 version constant to get expected behavior, works as it always did. Now in 5.0, there might be a migration tool or it will be able to read a 4.x index. If the 3.0 constant is gone and none of these tokens are reachable. Search requests will have the correct lower case i and will not be able to find those with the wrong one. It will be very obvious. Regarding this analyzer, code that uses a 2.x version constant for this analyzer will need to change to a 3.0 version constant in order for the index to be usable in the 4.x series if the 2.x constants are removed. I don't think this is an isolated example. With what's happening, every index that uses a deprecated version constant will have one very long major release cycle in which to rebuild their indexes from scratch. And as I said at the bottom of my last email, I'm going to re-index because I am able and because I want correct behavior. So whatever is decided won't affect my application of Lucene. -- DM --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org