Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 65502 invoked from network); 25 Apr 2007 19:28:02 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Apr 2007 19:28:02 -0000 Received: (qmail 67737 invoked by uid 500); 25 Apr 2007 19:28:00 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 67710 invoked by uid 500); 25 Apr 2007 19:28:00 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 67699 invoked by uid 99); 25 Apr 2007 19:28:00 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Apr 2007 12:28:00 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of abublic@gmail.com designates 66.249.90.179 as permitted sender) Received: from [66.249.90.179] (HELO ik-out-1112.google.com) (66.249.90.179) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 25 Apr 2007 12:27:50 -0700 Received: by ik-out-1112.google.com with SMTP id b35so357144ika for ; Wed, 25 Apr 2007 12:27:29 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:date:from:x-mailer:reply-to:x-priority:message-id:to:subject:in-reply-to:references:mime-version:content-type:content-transfer-encoding; b=h/ojzhEx5ufc4uW5oIrsxXuBUxVMaqiC8fVc9OkTTdqn1NSHzt2PRE7fQ0d7snTetEVxnCnc26Y94OEPV0wVifMrsctoGSf2a/XIksPJk6OK7SITqfzc+8I7ihbqFVtyFjDG8NjCNqj3lW9G0605q1OsIRaHDAfDiLNKX0VZ/Ag= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:date:from:x-mailer:reply-to:x-priority:message-id:to:subject:in-reply-to:references:mime-version:content-type:content-transfer-encoding; b=oiXGNhWcnfmum8NOzyiOFqoLKdDTgJzyYsoBe5UVF709e5og6I/58ml4vtsohj+XGQCETEBAlEWzd2WdYixksIF3n1gzXCMULMZ+r7wobw9Znm/e4Nx/iLwc5863GMlPpsjOL11s252mCI3w0Y6RMcL57rETKRpLGAq5Ac8pHMM= Received: by 10.78.156.6 with SMTP id d6mr293976hue.1177529249235; Wed, 25 Apr 2007 12:27:29 -0700 (PDT) Received: from ?81.195.137.185? ( [81.195.137.185]) by mx.google.com with ESMTP id 39sm2955773ugb.2007.04.25.12.27.28; Wed, 25 Apr 2007 12:27:28 -0700 (PDT) Date: Wed, 25 Apr 2007 23:27:10 +0400 From: Artem X-Mailer: The Bat! (v2.04.7) Personal Reply-To: Artem X-Priority: 3 (Normal) Message-ID: <131828200.20070425232710@gmail.com> To: Ivan Vasilev Subject: Re[2]: Out of memory exception for big indexes In-Reply-To: <462F34CC.4050609@sirma.bg> References: <46162A72.50400@sirma.bg> <359a92830704060730l76d38d22g44be77aa3bff3845@mail.gmail.com> <872e2d490704061203k6923dbd2rd6ffe0117a46837@mail.gmail.com> <814127511.20070408213229@gmail.com> <462D044F.9020802@sirma.bg> <872e2d490704240616p1f9f08c7u376d27916cf3950@mail.gmail.com> <462F34CC.4050609@sirma.bg> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hello Ivan, That was cool news! Thanks! :) The timings are surprisingly good. 10 mln docs sorted in 20s.. cool! Also it looks like sorting algorithm employed by Lucene is quite memory-economic. Not supporting multiple fields is in fact another limitation of my patch. I don't need it so I didn't implement it :) What is needed to implement it is probably do it manually - employ FieldSelector fetching that bunch of fields; change compare(ScoreDoc scoreDoc1, ScoreDoc scoreDoc2) method so that it compares docs by a bunch of fields (there should be also another array of Asc/Desc flags somewhere which makes this more complicated) instead of single field; that's it. I don't understand yet why Sort(SortField[] fields) didn't give the same when fields.length == 1.. Probably we should dig into Lucene code to find out. In case of several fields I can imagine why this approach would be less effective: at least N*2 Document reads (by StoredFieldComparator.sortValue) will be needed to compare 2 documents (N is length of fields array). One read with appropriate FieldSelector is likely to perform better. Anyway, I do think StoredFieldSortFactory's approach could be successfully applied to multiple fields, but I'm not going to implement it yet. May be you? :) Regards, Artem IV> Hi Artem, IV> Thank you very much for your mails :) IV> So first I have to tell you that your patch works perfectly even with IV> very big indexes - 40 GB (you can see the results bellow). IV> The reason I to have bad test results last time is that I made a bit IV> change (but I can not understand why this change made problem - on my IV> opinion it should not have so big effects on performance). IV> So the change that I made is - I added a new method in the class IV> StoredFieldSortFactory. It is the same like create(String sortFieldName, IV> boolean sortDescending) method but instead of wrapping SortField it IV> return it directly and in my class I wrap this object in a Sort one. IV> Here is the code: IV> public static SortField createSortField(String sortFieldName, boolean IV> sortDescending) { IV> return new SortField(sortFieldName, instance, sortDescending); IV> } IV> I do this because we have to support sorting on multiple fields and I IV> obtain all SortField objects in a cycle and then create Sort out of them: IV> Sort sort = new Sort(sortFields); IV> In my tests that were with very bad results (time for searches was more IV> than 5 mins) in all the tests I used sorting ONLY BY ONE FIELD (means IV> the array sortFields was always with length 1). IV> But I still used the constructor Sort(SortField[]) but not IV> Sort(SortField) as originally in your code in the method IV> StoredFieldSortFactory.create(..). IV> Do you think this is the reason for pure performance? IV> If so, COULD YOU PLEASE TELL ME how to use your patch for sorting on IV> multiple stored fields? IV> Here are the test result of your patch with different indexes (the tests IV> are with code just as you recommend to use it - with using of your IV> create(..) method that uses constructor Sort(SortField) ): IV> - CPU - Intel Core2Duo, max memory allowed to the process that makes IV> searching - 1GB (not all of it used) IV> ********************************************************************************************************** IV> - index size 3,3 GB, about 486 410 documents (all the testing searches IV> include all documents); IV> ____________________________________________________________________________________________ IV> - field size - it is file name and varies - on my opinion 15 - 30 chars IV> average. IV> - search time (ASC) - 1,312 s, memory usage - 71MB IV> - search time (DSC) - 1,281 s, memory usage - 71MB IV> - field size - it is abs path name and varies - on my opinion 60 - 90 IV> chars average. IV> - search time (ASC) - 1,344 s, memory usage - 71MB IV> - search time (DSC) - 1,328 s, memory usage - 71MB IV> - field size - it is file size and varies - on my opinion 3 - 7 chars IV> average. IV> - search time (ASC) - 1,313 s, memory usage - 71MB IV> - search time (DSC) - 1,312 s, memory usage - 71MB IV> ********************************************************************************** IV> - index size 21,4 GB, about 376 999 documents (all the testing searches IV> include all documents); IV> ____________________________________________________________________________________________ IV> - field size - it is file name and varies - on my opinion 15 - 30 chars IV> average. IV> - search time (ASC) - 0,875 s, memory usage - 371MB IV> - search time (DSC) - 0,828 s, memory usage - 371MB IV> - field size - it is abs path name and varies - on my opinion 60 - 90 IV> chars average. IV> - search time (ASC) - 0,844 s, memory usage - 371MB IV> - search time (DSC) - 0,813 s, memory usage - 371MB IV> - field size - it is file size and varies - on my opinion 3 - 7 chars IV> average. IV> - search time (ASC) - 0,813 s, memory usage - 371MB IV> - search time (DSC) - 0,797 s, memory usage - 371MB IV> ********************************************************************************** IV> - index size 42,9 GB, about 10 944 918 documents (all the testing IV> searches include all documents); IV> ____________________________________________________________________________________________ IV> - field size - it is file name and varies - on my opinion 15 - 30 chars IV> average. IV> - search time (ASC) - 21,905 s, memory usage - 625MB IV> - search time (DSC) - 21,781 s, memory usage - 625MB IV> - field size - it is abs path name and varies - on my opinion 60 - 90 IV> chars average. IV> - search time (ASC) - 21,874 s, memory usage - 625MB IV> - search time (DSC) - 21,749 s, memory usage - 625MB IV> - field size - it is file size and varies - on my opinion 3 - 7 chars IV> average. IV> - search time (ASC) - 21,687 s, memory usage - 625MB IV> - search time (DSC) - 21,812 s, memory usage - 625MB IV> THANK YOU VERY MUCH, IV> Ivan -- Best regards, Artem mailto:abublic@gmail.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org