Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 56420 invoked from network); 2 Mar 2007 15:54:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 2 Mar 2007 15:54:15 -0000 Received: (qmail 84491 invoked by uid 500); 2 Mar 2007 15:54:21 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 84374 invoked by uid 500); 2 Mar 2007 15:54:21 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 84255 invoked by uid 99); 2 Mar 2007 15:54:21 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2007 07:54:21 -0800 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Mar 2007 07:54:11 -0800 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 2305F7142EF for ; Fri, 2 Mar 2007 07:53:51 -0800 (PST) Message-ID: <6070446.1172850831140.JavaMail.jira@brutus> Date: Fri, 2 Mar 2007 07:53:51 -0800 (PST) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Updated: (LUCENE-710) Implement "point in time" searching without relying on filesystem semantics In-Reply-To: <21389840.1163533599114.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless updated LUCENE-710: -------------------------------------- Attachment: LUCENE-710.patch OK, I've attached a patch to implement "commit on close" and "custom deletion policies". The design is exactly what's described above. There are no changes to the file format. All tests pass and I've added additional tests for this new functionality. Summary of the external changes: * For "commit on close": - Added new IndexWriter constructors that take "autoCommit" boolean: if it's false, then readers will not see any actions done by this writer (no new segments_N is written) until writer.close() is called. - Added IndexWriter.abort() which closes the writer without committing, cleaning up any temp files it had added to the index. * For "custom deletion policies": - Created IndexDeletionPolicy interface and added constructors to IndexReader/IndexWriter allowing you to specify a deletion policy. - Created IndexCommitPoint interface: this is passed to the deletion policy to represent each commit. The policy calls the delete method on this interface to remove a commit. - Created one deletion policy (KeepOnlyLastCommitDeletionPolicy) and made that the default policy. (The unit test for this has other "interesting" policies like "delete by age since this commit was obsoleted" initially discussed on java-dev.) Summary of internal changes: * Created "files()" method in SegmentInfo (and changed SegmentReader.files() to use it). * Changed IndexFileDeleter to use reference counting to keep track of which files are deletable because no commit(s) (nor the in-memory SegmentInfos) reference them. This is a nice simplification of IndexFileDeleter: previously it had detailed knowledge about which files, extensions, etc., to look for and delete. Now it has far less of that because it relies entirely on SegmentInfo.files() to compute that. * Changed IndexReader/IndexWriter to not directly delete files and instead notify IndexFileDeleter when there has been a change to the in-memory SegmentInfos. The deleter then incref/decref's to determine what files can safely be deleted. This is also a nice simplification for the same reason as above: now the writers just make changes to SegmentInfo(s) without having to compute/track the consequences to specific index files. * Simplified the fix for LUCENE-702 (addIndexes corrupts index on disk full) to just temporarily set autoCommit=false if it's not already. * Added get/setDefaultInfoStream to IndexWriter so you could see things that happen during IndexWriter constructor. * No longer store/propogate persistent IndexFileDelter inside IndexReader (removed protected method get/setDeleter). This is a nice simplification because the deleter is now only needed briefly during "commit()". * Reworked the toplevel javadoc for IndexWriter. * Added try/finally to remove a partially written segments_N if we hit IOException when trying to write it. * Other small change (small refactoring, fixes to javadocs, fixed spelling, etc). > Implement "point in time" searching without relying on filesystem semantics > --------------------------------------------------------------------------- > > Key: LUCENE-710 > URL: https://issues.apache.org/jira/browse/LUCENE-710 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.1 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-710.patch > > > This was touched on in recent discussion on dev list: > http://www.gossamer-threads.com/lists/lucene/java-dev/41700#41700 > and then more recently on the user list: > http://www.gossamer-threads.com/lists/lucene/java-user/42088 > Lucene's "point in time" searching currently relies on how the > underlying storage handles deletion files that are held open for > reading. > This is highly variable across filesystems. For example, UNIX-like > filesystems usually do "close on last delete", and Windows filesystem > typically refuses to delete a file open for reading (so Lucene retries > later). But NFS just removes the file out from under the reader, and > for that reason "point in time" searching doesn't work on NFS > (see LUCENE-673 ). > With the lockless commits changes (LUCENE-701 ), it's quite simple to > re-implement "point in time searching" so as to not rely on filesystem > semantics: we can just keep more than the last segments_N file (as > well as all files they reference). > This is also in keeping with the design goal of "rely on as little as > possible from the filesystem". EG with lockless we no longer re-use > filenames (don't rely on filesystem cache being coherent) and we no > longer use file renaming (because on Windows it can fails). This > would be another step of not relying on semantics of "deleting open > files". The less we require from filesystem the more portable Lucene > will be! > Where it gets interesting is what "policy" we would then use for > removing segments_N files. The policy now is "remove all but the last > one". I think we would keep this policy as the default. Then you > could imagine other policies: > * Keep past N day's worth > * Keep the last N > * Keep only those in active use by a reader somewhere (note: tricky > how to reliably figure this out when readers have crashed, etc.) > * Keep those "marked" as rollback points by some transaction, or > marked explicitly as a "snaphshot". > * Or, roll your own: the "policy" would be an interface or abstract > class and you could make your own implementation. > I think for this issue we could just create the framework > (interface/abstract class for "policy" and invoke it from > IndexFileDeleter) and then implement the current policy (delete all > but most recent segments_N) as the default policy. > In separate issue(s) we could then create the above more interesting > policies. > I think there are some important advantages to doing this: > * "Point in time" searching would work on NFS (it doesn't now > because NFS doesn't do "delete on last close"; see LUCENE-673 ) > and any other Directory implementations that don't work > currently. > * Transactional semantics become a possibility: you can set a > snapshot, do a bunch of stuff to your index, and then rollback to > the snapshot at a later time. > * If a reader crashes or machine gets rebooted, etc, it could choose > to re-open the snapshot it had previously been using, whereas now > the reader must always switch to the last commit point. > * Searchers could search the same snapshot for follow-on actions. > Meaning, user does search, then next page, drill down (Solr), > drill up, etc. These are each separate trips to the server and if > searcher has been re-opened, user can get inconsistent results (= > lost trust). But with, one series of search interactions could > explicitly stay on the snapshot it had started with. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org