Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 14119 invoked from network); 1 May 2010 21:58:23 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 1 May 2010 21:58:23 -0000 Received: (qmail 96745 invoked by uid 500); 1 May 2010 21:58:22 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 96692 invoked by uid 500); 1 May 2010 21:58:22 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 96685 invoked by uid 99); 1 May 2010 21:58:22 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 May 2010 21:58:22 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 01 May 2010 21:58:19 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o41LvwiI021712 for ; Sat, 1 May 2010 21:57:58 GMT Message-ID: <9872383.2031272751078214.JavaMail.jira@thor> Date: Sat, 1 May 2010 17:57:58 -0400 (EDT) From: "Karthick Sankarachary (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2425) An Anti-Merging Multi-Directory Indexing Framework In-Reply-To: <24477412.1901272749157250.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863062#action_12863062 ] Karthick Sankarachary commented on LUCENE-2425: ----------------------------------------------- In this comment, we outline all split policies that have been or are in the process of being implemented. Hopefully, this will serve to not only validate the framework, but also be a reference point for future work. The split policies currently under development include: 1) A rotating split policy, which is essentially a time-bound index, where each sub-index denotes a (contiguous) time range, and there's a cap on the number of sub-indices. 2) An archiving split policy, which builds on the rotating split policy, where older sub-indexes (that have been rotated out) are kept around for a while before being removed. 3) A real-time split policy, which overcomes the near-real time limitation of current indices. It does so by essentially maintaing a cache for each reader obtained for that index. 4) A caching split policy, which builds on the real-time split policy, where writes (and other updates) to the index are buffered in-memory until it is told to flush. 5) A mirroring split policy, which treats each sub-directory as a mirror image of the super-directory. 6) A sharding split policy, which treats each sub-directory as a shard (or slice) or the super-directory. > An Anti-Merging Multi-Directory Indexing Framework > -------------------------------------------------- > > Key: LUCENE-2425 > URL: https://issues.apache.org/jira/browse/LUCENE-2425 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/*, Index > Affects Versions: 3.0.1 > Reporter: Karthick Sankarachary > Attachments: LUCENE-2425.patch > > > By design, a Lucene index tends to merge documents that span multiple segments into fewer segments, in order to optimize its directory structure, which in turn leads to better search performance. In particular, it relies on a merge policy to specify the set of merge operations that should be performed when the index is optimized. > Often times, there's a need to do the exact opposite, which is to "split" the documents. This calls for a mechanism that facilitates sub-division of documents based on a certain (ideally, user-defined) algorithm. By way of example, one may wish to sub-divide (or partition) documents based on parameters such as time, space, real-timeliness, and so on. Herein, we describe an indexing framework that builds on the Lucene index writer and reader, to address use cases wherein documents need to diverge rather than converge. > In brief, it associates zero or more sub-directories with the index's directory, which serve to complement it in some manner. The sub-directories (a.k.a. splits) are managed by a split policy, which is notified of all changes made to the index directory (a.k.a. super-directory), thus allowing it to modify its sub-directories as it sees fit. To make the index reader and writer "observable", we extend Lucene's reader and writer with the goal of providing hooks into every method that could potentially change the index. This allows for propagation of such changes to the split policy, which essentially acts as a listener on the index. > We refer to each sub-directory (or split) and the super-directory as a sub-index of the containing index (a.k.a. the split index). Note that the sub-directory may not necessarily be co-located with the super-directory. Furthermore, the split policy in turn relies on one or more split rules to determine when to add or remove sub-directories. This allows for a clear separation of the event that triggers a split from the management of those splits. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org