Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 98979 invoked from network); 10 Jun 2009 20:14:19 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 10 Jun 2009 20:14:19 -0000 Received: (qmail 18349 invoked by uid 500); 10 Jun 2009 20:14:26 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 18234 invoked by uid 500); 10 Jun 2009 20:14:26 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 18216 invoked by uid 99); 10 Jun 2009 20:14:26 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2009 20:14:26 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of jason.rutherglen@gmail.com designates 209.85.217.215 as permitted sender) Received: from [209.85.217.215] (HELO mail-gx0-f215.google.com) (209.85.217.215) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 10 Jun 2009 20:14:16 +0000 Received: by gxk11 with SMTP id 11so1361527gxk.5 for ; Wed, 10 Jun 2009 13:13:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=jIuD5XDY5yhAcPtsAk9xbupnfkrTjeTpavsvN0RSK94=; b=itlI+MOWwA7nxdUGnR0vg5pS6l4CkSUpdvo3Rtnhjx6pbmyI9e+dtvnJ1FDVq0bLbW WsxlrBY9Pmake7Wp1Cu4K5PDugXyzx+0f/8WFCE51MpI3i26sC3tZquj13NX6zm1tKoc eVxV5rGMi0WiZWA4/AD3tDGLGfIj3EHyVCwWw= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=wRM1u4QEHgHNlekQG7zsXQ6sbqbHTzdeRYzHuU4atY+AQq7oaizzZOGx8CAhcHJmeX gLcD5CqB0VskiNxIPQRyIGfH+0miLZB4G5BsieaMbDLX2p/16rkvLYkOewv3IcZp/Y5+ CPMHsvK56HOkrXSD/CcIKaHVkX18G4KNPxf8o= MIME-Version: 1.0 Received: by 10.151.141.15 with SMTP id t15mr3311717ybn.309.1244664835553; Wed, 10 Jun 2009 13:13:55 -0700 (PDT) In-Reply-To: <9ac0c6aa0906101126m5afc415bu4575cd2bd7caadff@mail.gmail.com> References: <20090610122347.GB5557@kopfschmerz.zuhause> <9ac0c6aa0906100540q41d1aa4fq2910521623b2edc3@mail.gmail.com> <85d3c3b60906101102v49cc3cc4uedbf473da4350c35@mail.gmail.com> <9ac0c6aa0906101126m5afc415bu4575cd2bd7caadff@mail.gmail.com> Date: Wed, 10 Jun 2009 13:13:55 -0700 Message-ID: <85d3c3b60906101313t77d8b16atc4a2644ecd158e9@mail.gmail.com> Subject: Re: Lucene memory usage From: Jason Rutherglen To: java-user@lucene.apache.org, java-dev@lucene.apache.org Content-Type: multipart/alternative; boundary=00151750d9686376d9046c041d28 X-Virus-Checked: Checked by ClamAV on apache.org --00151750d9686376d9046c041d28 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Great! If I understand correctly it looks like RAM savings? Will there be an improvement in lookup speed? (We're using binary search here?). Is there a precedence in database systems for what was mentioned about placing the term dict, delDocs, and filters onto disk and reading them from there (with the IO cache taking care of keeping the data in RAM)? (Would there be a future advantage to this approach when SSDs are more prevalent?) It seems like we could have some generalized pluggable system where one could try out this or the current heap approach, and benchmark. Given our continued inability to properly measure Java RAM usage, this approach may be a good one for Lucene? Where heap based LRU caches are a shot in the dark when it comes to mem size, as we never really know how much they're using. Once we generalize delDocs, filters, and field caches (LUCENE-831?), then perhaps CSF is a good place to test out this approach? We could have a generic class that handles the underlying IO that simply returns values based on a position or iteration. On Wed, Jun 10, 2009 at 11:26 AM, Michael McCandless < lucene@mikemccandless.com> wrote: > Roughly, the current approach for the default terms dict codec in > LUCENE-1458 is: > > * Create a separate class per-field (the String field in each Term > is redundant). This is a big change over Lucene today.... > > * That class has String[] indexText and long[] indexPointer, each > length = the number of index terms. No TermInfo instance nor Term > instance are used. > > * Modify the tis format to also store its data by field > > * Modify the tis format so that at a seek point (ie an indexed > term), absolute values are written for freq/prox pointer, but > continue to delta-code in between indexed terms. EG this is how > video codecs work (every so often they write a "key frame" which > you can seek to & immediately decode w/ no prior context). > > * tii then just stores text/long (delta coded) for all indexed > terms, and is slurped into the arrays on init. > > This is a sizable RAM savings over what's done now because you save 2 > objects, 3 pointers, 2 longs, 2 ints (I think), per indexed term. > > Mike > > On Wed, Jun 10, 2009 at 2:02 PM, Jason > Rutherglen wrote: > >> LUCENE-1458 (flexible indexing) has these improvements, > > > > Mike, can you explain how it's different? I looked through the code once > > but yeah, it's in with a lot of other changes. > > > > On Wed, Jun 10, 2009 at 5:40 AM, Michael McCandless < > > lucene@mikemccandless.com> wrote: > > > >> This (very large number of unique terms) is a problem for Lucene > currently. > >> > >> There are some simple improvements we could make to the terms dict > >> format to not require so much RAM per term in the terms index... > >> LUCENE-1458 (flexible indexing) has these improvements, but > >> unfortunately tied in w/ lots of other changes. Maybe we should break > >> out a separate issue for this... this'd be a great contained > >> improvement, if anyone out there has "the itch" :) > >> > >> One simple workaround is to call IndexReader.setTermIndexInterval > >> immediately after opening the reader; this simply loads fewer terms in > >> the index, using far less RAM, but at the expense of somewhat slower > >> searching. > >> > >> Also: you should peek at your index, eg using Luke, to understand why > >> you have so many terms. It could be legitimate (indexing a massive > >> catalog with eg part numbers), or, it could be your document filtering > >> / analyzer are accidentally producing garbage terms. > >> > >> Mike > >> > >> On Wed, Jun 10, 2009 at 8:23 AM, Benedikt Boss wrote: > >> > Hej hej, > >> > > >> > i have a question regarding lucenes memory usage > >> > when launching a query. When i execute my query > >> > lucene eats up over 1gig of heap-memory even > >> > when my result-set is only a single hit. I > >> > found out that this is due to the "ensureIndexIsRead()" > >> > method-call in the "TermInfosReader" class, which > >> > iterates over all Terms found in the index and saves > >> > them (including all value-strings) in a Term-Array. > >> > Is it possible to not read all that stuff > >> > into memory at all? > >> > > >> > Im doing the query like in the following pseudo-code: > >> > > ------------------------------------------------------------------------ > >> > > >> > TopScoreDocCollector collector = new TopScoreDocCollector(100000); > >> > > >> > QueryParser parser= new QueryParser(field, new WhitespaceAnalyzer() > ); > >> > Directory fsDir = new FSDirectory(indexDir, null); > >> > IndexSearcher is = new IndexSearcher(fsdir); > >> > > >> > Query query = parser.parse(q); > >> > > >> > is.search(query, collector); > >> > ScoreDoc[] hits = collector.topDocs(); > >> > > >> > ....... < iterate over hits and print results > > >> > > >> > > >> > Thanks in advance > >> > Benedikt > >> > > >> > --------------------------------------------------------------------- > >> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> > For additional commands, e-mail: java-user-help@lucene.apache.org > >> > > >> > > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >> For additional commands, e-mail: java-user-help@lucene.apache.org > >> > >> > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > --00151750d9686376d9046c041d28--