From java-user-return-36547-apmail-lucene-java-user-archive=lucene.apache.org@lucene.apache.org Fri Oct 10 13:20:25 2008 Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 42244 invoked from network); 10 Oct 2008 13:20:25 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 10 Oct 2008 13:20:25 -0000 Received: (qmail 16348 invoked by uid 500); 10 Oct 2008 13:20:17 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 16190 invoked by uid 500); 10 Oct 2008 13:20:17 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 16179 invoked by uid 99); 10 Oct 2008 13:20:17 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 10 Oct 2008 06:20:17 -0700 X-ASF-Spam-Status: No, hits=0.1 required=10.0 tests=DNS_FROM_SECURITYSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [217.12.10.216] (HELO web26005.mail.ukl.yahoo.com) (217.12.10.216) by apache.org (qpsmtpd/0.29) with SMTP; Fri, 10 Oct 2008 13:19:12 +0000 Received: (qmail 28616 invoked by uid 60001); 10 Oct 2008 13:18:46 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=X-YMail-OSG:Received:X-Mailer:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID; b=Od8HrPeIXMtjmom/B0KGyButTSfQFvA+9RHZymEX5KSxd0DubKmXk0pfHXFnObhtVVwGd+yHWI6TjhQhg4B8Y5xafXIKCA83RgTGZxd0PYsIFToh5yzZQ+vvVyLf6B9Z3RKiiP0qtHkvooGBPpEH5yd3auK7WTZn8D656G5eG7o=; X-YMail-OSG: TpSPrfoVM1nbKBmCsQa4YKAaZje.E8pgHa08YSjXyAdPDoAqCTygzNi6IXVakqlrOOuv2TaVKWS02wbXIbGpMyk0GhBLiztOnn7_0s.3BMngjRSPcVMuQdn3_V4XLFY8kN9TEGRSFq_PnZm5mW0b9Sn1PGU3oH23oHhitDbiqYLfMEh81HI- Received: from [193.36.230.96] by web26005.mail.ukl.yahoo.com via HTTP; Fri, 10 Oct 2008 13:18:46 GMT X-Mailer: YahooMailRC/1096.40 YahooMailWebService/0.7.218.2 Date: Fri, 10 Oct 2008 13:18:46 +0000 (GMT) From: mark harwood Subject: Re: Question regarding sorting and memory consumption in lucene To: java-user@lucene.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-ID: <915178.28087.qm@web26005.mail.ukl.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org Assuming content is added in chronological order and with no updates to exi= sting docs couldn't you rely on internal Lucene document id to give a chron= ological sort order?=0AThat would require no memory cache at all when sorti= ng.=0A=0AQuerying across multiple indexes simultaneously however may presen= t an added complication...=0A=0A=0A=0A----- Original Message ----=0AFrom: A= leksander M. Stensby =0ATo: java-user@luc= ene.apache.org=0ASent: Friday, 10 October, 2008 13:51:50=0ASubject: Re: Que= stion regarding sorting and memory consumption in lucene=0A=0AI'll follow u= p on my own question...=0ALet's say that we have 4 years of data, meaning t= hat there will be roughly =0A4 * 365 =3D 1460 unique terms for our sort fi= eld.=0AFor one index, lets say with 30 million docs, the cache should use a= pprox =0A100mb, or am I wrong? and thus for 6 indexes we would need approx= 600 mb =0Afor the caches? (and an additional 100mb every time we warm a n= ew searcher =0Aand swap it out...) As far as the string versus int or long= goes, I don't =0Areally see any big gain in changig it since 1460 * 10 b= ytes extra memory =0Adoesnt really make much difference. Or?=0A=0AI guess = we should consider reducing the index size or at least only allow =0Asorte= d search on a subset of the index (or a pruned version of the =0Aindex...)= ? Would that be better for us?=0ABut then again, I assume that there are m= uch larger lucene-based indexes =0Aout there than ours, and you guys must = have some solution to this issue, =0Aright? :)=0A=0Abest regards,=0A Alek= sander=0A=0A=0AOn Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby = =0A wrote:=0A=0A> Hello, I've read a lot = of threads now on memory consumption and sorting, =0A> and I think I have = a pretty good understanding of how things work, but I =0A> could still nee= d some input here..=0A>=0A> We currently have a system consisting of 6 diff= erent lucene indexes (all =0A> have the same structure, so you could say i= t is a form of sharding). We =0A> currently use this approach because we w= ant to be able to give users =0A> access to different index (but not neces= sarily all indexes) etc.=0A>=0A> (We are planning to move to a solr-based = system, but for now we would =0A> like to solve this issue with our curren= t lucene-based system.)=0A>=0A> The thing is, the indexes are rather big (r= anging from 5G to 20G per =0A> index and 10 - 30 million entries per index= .)=0A> We keep one searcher object open per index, and when the index is = =0A> changed (new documents added in batches several times a day), we updat= e =0A> the searcher objects.=0A> In the warmup procedure we did a couple o= f searches and things work =0A> fine, BUT i realized that in our applicati= on we return hits sorted by =0A> date by default, and our warmup procedure= did non-sorted queries... so =0A> still the first searches done by the us= er after an update was slow =0A> (obviously).=0A>=0A> To cope, I changed t= he warmup procedure to include a sorted search, and =0A> now the user will= not notice slow queries. Good!=0A> But, the problem at hand is that we are= running into memory problems =0A> (and I understand that sorting does con= sume a lot of memory...) But is =0A> there any way that is "best practice"= to deal with this? The field we =0A> sort on is an un_indexed text field = representing the date. typically =0A> "2008-10-10". I am aware that string= field sorting consumes a lot of =0A> memory, so should we change this fie= ld to something different? Would =0A> this help us with the memory problem= s?=0A>=0A> As a sidenote / couriosity question: Does it matter if we use th= e search =0A> method returning Hits versus the search method returning Top= FieldDocs? =0A> (we are not iterating them in any way when this memory iss= ue occurs)=0A>=0A> Thanks in advance for any guidance we may get.=0A>=0A> B= est regards,=0A> Aleksander M. Stensby=0A>=0A>=0A>=0A=0A=0A=0A-- =0AAleks= ander M. Stensby=0ASenior Software Developer=0AIntegrasco A/S=0A+47 41 22 8= 2 72=0Aaleksander.stensby@integrasco.no=0A=0A------------------------------= ---------------------------------------=0ATo unsubscribe, e-mail: java-user= -unsubscribe@lucene.apache.org=0AFor additional commands, e-mail: java-user= -help@lucene.apache.org=0A=0A=0A --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org