Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 64533 invoked from network); 25 Nov 2008 00:36:38 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Nov 2008 00:36:38 -0000 Received: (qmail 86690 invoked by uid 500); 25 Nov 2008 00:36:45 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 86652 invoked by uid 500); 25 Nov 2008 00:36:45 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 86643 invoked by uid 99); 25 Nov 2008 00:36:45 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Nov 2008 16:36:45 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Nov 2008 00:35:28 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 4BAA6234C28D for ; Mon, 24 Nov 2008 16:35:44 -0800 (PST) Message-ID: <1192302391.1227573344308.JavaMail.jira@brutus> Date: Mon, 24 Nov 2008 16:35:44 -0800 (PST) From: "Marvin Humphrey (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1458) Further steps towards flexible indexing In-Reply-To: <1307591248.1227004424235.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650412#action_12650412 ] Marvin Humphrey commented on LUCENE-1458: ----------------------------------------- > Be careful: it's the seeking that kills you (until we switch to SSDs > at which point perhaps most of this discussion is moot!). Even though > the terms index net size is low, if re-heating the spots you touch > incurs 20 separate page misses, you lose. Perhaps for such situations, we can make it possible to create custom HotLexiconReader or HotIndexReader subclasses that slurp term index files and what-have-you into process memory. Implementation would be easy, since we can just back the InStreams with malloc'd RAM buffers rather than memory mapped system buffers. Consider the tradeoffs. On the one hand, if we rely on memory mapped buffers, busy systems may experience sluggish search after long lapses in a worst case scenario. On the other hand, reading a bunch of stuff into process memory makes IndexReader a lot heavier, with large indexes imposing consistently sluggish startup and a large RAM footprint on each object. > It seems like the ability to very quickly launch brand new searchers > is/has become a strong design goal of Lucy/KS. What's the driver > here? Is it for near-realtime search? Near-realtime search is one of the motivations. But lightweight IndexReaders are more convenient in all sorts of ways. Elaborate pre-warming rituals are necessary with heavy IndexReaders whenever indexes get modified underneath a persistent search service. This is certainly a problem when you are trying to keep up with real-time insertions, but it is also a problem with batch updates or optimization passes. With lightweight IndexReaders, you can check whether the index has been modified as requests come in, launch a new Searcher if it has, then deal with the request after a negligible delay. You have to warm the system io caches when the service starts up ("cat /path/to/index/* > /dev/null"), but after that, there's no more need for background warming. Lightweight IndexReaders can also be sprinkled liberally around source code in a way that heavy IndexReaders cannot. For instance, each thread in a multi-threaded server can have its own Searcher. Launching cheap search processes is also important when writing tools akin to the Unix command line 'locate' app. The first time you invoke locate it's slow, but subsequent invocations are nice and quick. You can only mimic that with a lightweight IndexReader. And so on... The fact that segment data files are never modified once written makes the Lucene/Lucy/KS file format design particularly well suited for memory mapping and sharing via the system buffers. In addition to the reasons cited above, intuition tells me that this is the right design decision and that there will be other opportunities not yet anticipated. I don't see how Lucy can deny such advantages to most users for the sake of those few for whom term dictionary cache eviction proves to be a problem, especially when we can offer those users a remedy. > The biggest problem with the "load important stuff into RAM" approach, > of course, is we can't actually pin VM pages from java, which means > the OS will happily swap out my RAM anyway, at which point of course > we should have used mmap. We can't realistically pin pages from C, either, at least on Unixen. Modern Unixen offer the mlock() command, but it has a crucial limitation -- you have to run it as root. Also, there aren't any madvise() flags that hint to the OS that the mapped region should stay hot. The closest thing is MADV_WILLNEED, which communicates "this will be needed soon" -- not "keep this around". > Further steps towards flexible indexing > --------------------------------------- > > Key: LUCENE-1458 > URL: https://issues.apache.org/jira/browse/LUCENE-1458 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Affects Versions: 2.9 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch, LUCENE-1458.patch > > > I attached a very rough checkpoint of my current patch, to get early > feedback. All tests pass, though back compat tests don't pass due to > changes to package-private APIs plus certain bugs in tests that > happened to work (eg call TermPostions.nextPosition() too many times, > which the new API asserts against). > [Aside: I think, when we commit changes to package-private APIs such > that back-compat tests don't pass, we could go back, make a branch on > the back-compat tag, commit changes to the tests to use the new > package private APIs on that branch, then fix nightly build to use the > tip of that branch?o] > There's still plenty to do before this is committable! This is a > rather large change: > * Switches to a new more efficient terms dict format. This still > uses tii/tis files, but the tii only stores term & long offset > (not a TermInfo). At seek points, tis encodes term & freq/prox > offsets absolutely instead of with deltas delta. Also, tis/tii > are structured by field, so we don't have to record field number > in every term. > . > On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB > -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB). > . > RAM usage when loading terms dict index is significantly less > since we only load an array of offsets and an array of String (no > more TermInfo array). It should be faster to init too. > . > This part is basically done. > * Introduces modular reader codec that strongly decouples terms dict > from docs/positions readers. EG there is no more TermInfo used > when reading the new format. > . > There's nice symmetry now between reading & writing in the codec > chain -- the current docs/prox format is captured in: > {code} > FormatPostingsTermsDictWriter/Reader > FormatPostingsDocsWriter/Reader (.frq file) and > FormatPostingsPositionsWriter/Reader (.prx file). > {code} > This part is basically done. > * Introduces a new "flex" API for iterating through the fields, > terms, docs and positions: > {code} > FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum > {code} > This replaces TermEnum/Docs/Positions. SegmentReader emulates the > old API on top of the new API to keep back-compat. > > Next steps: > * Plug in new codecs (pulsing, pfor) to exercise the modularity / > fix any hidden assumptions. > * Expose new API out of IndexReader, deprecate old API but emulate > old API on top of new one, switch all core/contrib users to the > new API. > * Maybe switch to AttributeSources as the base class for TermsEnum, > DocsEnum, PostingsEnum -- this would give readers API flexibility > (not just index-file-format flexibility). EG if someone wanted > to store payload at the term-doc level instead of > term-doc-position level, you could just add a new attribute. > * Test performance & iterate. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org