Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 17866 invoked from network); 10 Oct 2008 14:59:17 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Oct 2008 14:59:17 -0000 Received: (qmail 73163 invoked by uid 500); 10 Oct 2008 14:59:10 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 72891 invoked by uid 500); 10 Oct 2008 14:59:09 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 72880 invoked by uid 99); 10 Oct 2008 14:59:09 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Oct 2008 07:59:09 -0700 X-ASF-Spam-Status: No, hits=0.3 required=10.0 tests=DNS_FROM_SECURITYSAGE,RCVD_IN_DNSWL_LOW,SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [62.73.241.54] (HELO pebbles.fastcom.no) (62.73.241.54) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Oct 2008 14:58:05 +0000 Received: from melkor (integrasco.bzzware.org [213.236.150.62]) by pebbles.fastcom.no (Postfix) with ESMTP id 009CF6060 for ; Fri, 10 Oct 2008 16:58:05 +0200 (DFT) Date: Fri, 10 Oct 2008 16:58:14 +0200 To: java-user@lucene.apache.org Subject: Re: Question regarding sorting and memory consumption in lucene From: "Aleksander M. Stensby" Organization: Integrasco A/S Content-Type: text/plain; format=flowed; delsp=yes; charset=utf-8 MIME-Version: 1.0 References: <306074.50126.qm@web26004.mail.ukl.yahoo.com> Content-Transfer-Encoding: 7bit Message-ID: In-Reply-To: <306074.50126.qm@web26004.mail.ukl.yahoo.com> User-Agent: Opera Mail/9.60 (Linux) X-Virus-Checked: Checked by ClamAV on apache.org Yes, I understand that, and I did mean the number of documents, but I read in the javadoc that: "For String fields, the cache is larger: in addition to the above array, the value of every term in the field is kept in memory. If there are many unique terms in the field, this could be quite large." and in one of the mails on the mailinglist i read: "So if your field is an Int your talking numDocs*32 bits for your cache. For a Long field its numDocs*64. For a String field Lucene caches a String array with every unique term and then an int array indexing into the term array." But yes, I understand that the memory drain should not be significant due to the String field in our case since, but the fact that we have so many documents. Still, if we have 30 000 000 docs * 10 bytes = ~300 mb (ok I was a bit off there... :p) If we had a long field instead, it would be about 220 mb or something like that? hmm.. But does that mean that I will just have to reduce my index (at least for the sorting) ? No other option? Cheers, and thanks for your help! - Aleks On Fri, 10 Oct 2008 16:43:35 +0200, mark harwood wrote: > I think you have your memory cost calculation wrong. > The cost is field size (10 bytes ?) times number of documents NOT number > of unique terms. > The cache is essentially an array of size reader.maxDoc() which is > indexed directly into on docId to retrieve field values. > > You are right in needing to factor in the cost of keeping one active > cache while busy warming-up a new one so that effectively doubles the > RAM requirements. > > > > > > > ----- Original Message ---- > From: Aleksander M. Stensby > To: java-user@lucene.apache.org > Sent: Friday, 10 October, 2008 15:25:29 > Subject: Re: Question regarding sorting and memory consumption in lucene > > Unfortunately no, since the documents that are added may come form a new > "source" containing old documents aswell..:/ > I tried deploying our webapplication without any searcher objects and it > consumes basically ~200mb of memory in tomcat. > With 6 searchers the same applications manages to consume over 2.5 GB of > memory when warming... :( > I might have done some super-idiotic logic in the way I handle searching, > but I can seriously not see what that might be... > > But I assume that people deal with much larger indexes than this, right? > > cheers, > Aleksander > > > On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood > > wrote: > >> Assuming content is added in chronological order and with no updates to >> existing docs couldn't you rely on internal Lucene document id to give a >> chronological sort order? >> That would require no memory cache at all when sorting. >> >> Querying across multiple indexes simultaneously however may present an >> added complication... >> >> >> >> ----- Original Message ---- >> From: Aleksander M. Stensby >> To: java-user@lucene.apache.org >> Sent: Friday, 10 October, 2008 13:51:50 >> Subject: Re: Question regarding sorting and memory consumption in lucene >> >> I'll follow up on my own question... >> Let's say that we have 4 years of data, meaning that there will be >> roughly >> 4 * 365 = 1460 unique terms for our sort field. >> For one index, lets say with 30 million docs, the cache should use >> approx >> 100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb >> for the caches? (and an additional 100mb every time we warm a new >> searcher >> and swap it out...) As far as the string versus int or long goes, I >> don't >> really see any big gain in changig it since 1460 * 10 bytes extra >> memory >> doesnt really make much difference. Or? >> >> I guess we should consider reducing the index size or at least only >> allow >> sorted search on a subset of the index (or a pruned version of the >> index...) ? Would that be better for us? >> But then again, I assume that there are much larger lucene-based indexes >> out there than ours, and you guys must have some solution to this issue, >> right? :) >> >> best regards, >> Aleksander >> >> >> On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby >> wrote: >> >>> Hello, I've read a lot of threads now on memory consumption and >>> sorting, >>> and I think I have a pretty good understanding of how things work, but >>> I >>> could still need some input here.. >>> >>> We currently have a system consisting of 6 different lucene indexes >>> (all >>> have the same structure, so you could say it is a form of sharding). We >>> currently use this approach because we want to be able to give users >>> access to different index (but not necessarily all indexes) etc. >>> >>> (We are planning to move to a solr-based system, but for now we would >>> like to solve this issue with our current lucene-based system.) >>> >>> The thing is, the indexes are rather big (ranging from 5G to 20G per >>> index and 10 - 30 million entries per index.) >>> We keep one searcher object open per index, and when the index is >>> changed (new documents added in batches several times a day), we update >>> the searcher objects. >>> In the warmup procedure we did a couple of searches and things work >>> fine, BUT i realized that in our application we return hits sorted by >>> date by default, and our warmup procedure did non-sorted queries... so >>> still the first searches done by the user after an update was slow >>> (obviously). >>> >>> To cope, I changed the warmup procedure to include a sorted search, and >>> now the user will not notice slow queries. Good! >>> But, the problem at hand is that we are running into memory problems >>> (and I understand that sorting does consume a lot of memory...) But is >>> there any way that is "best practice" to deal with this? The field we >>> sort on is an un_indexed text field representing the date. typically >>> "2008-10-10". I am aware that string field sorting consumes a lot of >>> memory, so should we change this field to something different? Would >>> this help us with the memory problems? >>> >>> As a sidenote / couriosity question: Does it matter if we use the >>> search >>> method returning Hits versus the search method returning TopFieldDocs? >>> (we are not iterating them in any way when this memory issue occurs) >>> >>> Thanks in advance for any guidance we may get. >>> >>> Best regards, >>> Aleksander M. Stensby >>> >>> >>> >> >> >> > > > -- Aleksander M. Stensby Senior Software Developer Integrasco A/S +47 41 22 82 72 aleksander.stensby@integrasco.no --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org