Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 9086 invoked from network); 26 Sep 2006 22:35:58 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 26 Sep 2006 22:35:58 -0000 Received: (qmail 24588 invoked by uid 500); 26 Sep 2006 22:35:53 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 24298 invoked by uid 500); 26 Sep 2006 22:35:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 24287 invoked by uid 99); 26 Sep 2006 22:35:52 -0000 Received: from idunn.apache.osuosl.org (HELO idunn.apache.osuosl.org) (140.211.166.84) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 26 Sep 2006 15:35:52 -0700 Authentication-Results: idunn.apache.osuosl.org header.from=markharw00d@yahoo.co.uk; domainkeys=bad Authentication-Results: idunn.apache.osuosl.org smtp.mail=markharw00d@yahoo.co.uk; spf=permerror X-ASF-Spam-Status: No, hits=1.9 required=5.0 tests=DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_POST Received-SPF: error (idunn.apache.osuosl.org: domain yahoo.co.uk from 217.12.11.33 cause and error) DomainKey-Status: bad X-DomainKeys: Ecelerity dk_validate implementing draft-delany-domainkeys-base-01 Received: from [217.12.11.33] ([217.12.11.33:41879] helo=smtp002.mail.ukl.yahoo.com) by idunn.apache.osuosl.org (ecelerity 2.1.1.8 r(12930)) with ESMTP id 22/F1-17360-42BA9154 for ; Tue, 26 Sep 2006 15:35:36 -0700 Received: (qmail 32760 invoked from network); 26 Sep 2006 22:34:36 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.co.uk; h=Received:Message-ID:Date:From:User-Agent:MIME-Version:To:Subject:References:In-Reply-To:Content-Type:Content-Transfer-Encoding; b=RDlkFdvQTR043Iy1sQJfTeQmOUl1kPg+rVJKGdMKp9G327pZX4uV2AS4Db0/VJkmm5EiCbPMMPkLp4sv6xCl9OgjBLEDJ1ayL+c/xWSv0uLg7TUgwg42PVDej0Q8vommCfPMXgpNgEyGNwpDJ2P1r9pEsZlNOXTwqIcsvnmXPUg= ; Received: from unknown (HELO ?127.0.0.1?) (markharw00d@194.106.34.5 with plain) by smtp002.mail.ukl.yahoo.com with SMTP; 26 Sep 2006 22:34:36 -0000 Message-ID: <4519AAF9.7050605@yahoo.co.uk> Date: Tue, 26 Sep 2006 23:34:33 +0100 From: markharw00d User-Agent: Thunderbird 1.5.0.7 (Windows/20060909) MIME-Version: 1.0 To: java-user@lucene.apache.org Subject: Re: how to get results without getting total number of found documents? References: <373399D6389D8C42B4EAD9BE2B6B5C5F0715171F@tormail01.cihi.ca> In-Reply-To: <373399D6389D8C42B4EAD9BE2B6B5C5F0715171F@tormail01.cihi.ca> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N >>- get the top 1000 results WITHOUT executing query across whole data set (Apologies if this is telling something you are already fully aware of ) - Counting matches doesn't involve scanning the text of all the docs so may be less expensive than you think for a single index. It very quickly looks up and ranks only the docs containing your search terms so a total match count is not an expensive by-product of this operation - see a description of inverted indexes for more details: http://en.wikipedia.org/wiki/Inverted_index If you're aware of all that and considering larger scale problems (billions of docs) where multiple machines/indexes must be queried in parallel things are more complex. The cost of combining result scores from multiple machines is typically why you can't page beyond 1000 results. Some of these large distributed architectures will divide content into popular/recent content and older/less popular content. Approximations for total number of matching docs are calculated based on queries executed solely on the subset of popular stuff. Only queries with insufficient matches in popular content will resort to querying the older stuff. Cheers Mark Vladimir Olenin wrote: > Hi. > > I couldn't find the answer to this question in the mailing list archive. > In case I missed it, please let me know the keyword phrase I should be > looking for, if not a direct link. > > All the 'Lucene' powered implementations I saw (well, primarily those > utilizing Solr) return exact count of the number of documents found. It > means that the query is resolved across the whole data set in precise > fashion. If the number of searched documents is huge (eg, > 1billion), > this should present quite a problem. I wonder if that's the default > behaviour of Lucene or rather the frameworks that utilize it? Is it > possible to: > > - get the top 1000 results WITHOUT executing query across whole data set > - in other words, can Lucene: > - chunk out top X results by 'approximate' fast search, which will > return _approximate_ total number of found documents, similar to > 'Google' total pages found count > - and perform more accurate search within that chunk > > Is such functionality built in or it has be customized? If it's > built-in, what algorithms are used to 'chunk out' the results and get > approximate docs count? What classes should I look at? > > Thanks! > > Vlad > > PS: it's pretty much the functionality Google has - you can't get more > than 1000 matches per query (meaning, you can get even '10M' documents > found, but if you'll try to browse beyond '1000' results, you'll get an > error page). > > ___________________________________________________________ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org