Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 86239 invoked from network); 4 Dec 2008 21:13:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Dec 2008 21:13:47 -0000 Received: (qmail 18426 invoked by uid 500); 4 Dec 2008 21:13:52 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 18395 invoked by uid 500); 4 Dec 2008 21:13:52 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 18384 invoked by uid 99); 4 Dec 2008 21:13:52 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2008 13:13:52 -0800 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of erickerickson@gmail.com designates 74.125.78.24 as permitted sender) Received: from [74.125.78.24] (HELO ey-out-2122.google.com) (74.125.78.24) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Dec 2008 21:12:23 +0000 Received: by ey-out-2122.google.com with SMTP id 6so1774582eyi.53 for ; Thu, 04 Dec 2008 13:13:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to :subject:in-reply-to:mime-version:content-type:references; bh=NHk8e3g1QF19QZGkOyRl7Zc4w3e++JTkVZ0UXPdKtVc=; b=E2ei17Nk+4L4Xo9jF3t6xqyME+SNjIb8pOQ7b8m17HX05izIxClsNO1tyVNowe000F Ln+dZ+d9uyTdSqqqqWoFaeLme6Tifzy0PCj6pHHk/15fifZqLx3GJqMGueohsst7A8HD 8N0oqkDXbo4tHOI64xIUxuS+VDefzmSwtISHo= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:in-reply-to:mime-version :content-type:references; b=JrG+lNCb+ier8UWvYNohdwUY0zgatbmMmKBY1h/JDLtxpGCKVHWLhOWpU0WByRa2by IK9Uz5/wiB2LKMIwDqG+08kQIkrPI8vh/uz9ra5uVjNdIlM4qgH4Ag2PkmJwc4U6JL3/ X+tZ2scIy0iwt11XOQNoqwSA8lCcZecmtJ/ZY= Received: by 10.86.57.9 with SMTP id f9mr122771fga.32.1228425179982; Thu, 04 Dec 2008 13:12:59 -0800 (PST) Received: by 10.86.25.19 with HTTP; Thu, 4 Dec 2008 13:12:59 -0800 (PST) Message-ID: <359a92830812041312l52d6a8bet5a3032e7dbcbb4c8@mail.gmail.com> Date: Thu, 4 Dec 2008 16:12:59 -0500 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Slow queries with lots of hits In-Reply-To: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_30829_4573757.1228425179984" References: <359a92830812041239i56d3d22m77e4c7370c1a45e6@mail.gmail.com> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_30829_4573757.1228425179984 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Huh? TopDocCollector isn't biased unless you suppose that you'll have many documents scoring *exactly* the same. You collect the top N scoring documents. Actually, I think this is all pretty much done for you with the Searcher.search(Query query, Filter filter, int n) method. You can pass null for Filter....... Best Erick On Thu, Dec 4, 2008 at 3:52 PM, Tim Sturge wrote: > That makes sense. I should be more precise in that all I need is 100 of t= he > 10000 "reasonable" results. > > The concern I would have with a TopDocCollector is that this is biased > towards the top of the index which translates for me into a bias for olde= r > documents. I'd prefer no age bias or a newer document bias. So I'll see > what > I can do with a "BottomDocCollector" :-) > > Tim > > > On 12/4/08 12:39 PM, "Erick Erickson" wrote: > > > The problem here is how *could* a system return even the top > > 10,000 results without scoring them all? What if the millionth > > hit resulted in the very best match in the entire corpus? > > > > That said, sorting may well be the issue here rather than scoring. > > You can use a TopDocCollector to get the top N matches (unsorted) > > and then do something like use the FieldSortedHitQueue to sort > > those N matches, leaving out all the rest of the matches. Note > > this assumes that when you say "sorting" you mean sorting > > by something other than relevance..... > > > > Hope this helps > > Erick > > > > On Thu, Dec 4, 2008 at 3:27 PM, Tim Sturge wrote: > > > >> Hi all, > >> > >> I have an interesting problem with my query traffic. Most of the queri= es > >> run > >> in a fairly short amount of time (< 100ms) but a few take over 1000ms. > >> These > >> queries are predominantly those with a huge number of hits (>1 million > hits > >> in a >100 million document index). The time taken (as far as I can tel= l) > is > >> for lucene to sit there while it scores and sorts all these results. > >> > >> However it turns out these queries really don=B9t have top results. Th= at > is, > >> of the million documents, there are easily 10000 which are decent > results > >> (basically those above some threshold score). Frankly, just returning > some > >> consistent (so paging and reload work) but > >> otherwise arbitrary ranking of these 10000 results would be more than > good > >> enough. > >> > >> It seems to me that a solution would be to impose some sort of > >> pseudo-random > >> filter (e.g. consider only every n-th document assuming they are > uniformly > >> distributed). I=B9m wondering if anyone else has experience with this = sort > of > >> issue and what solutions they have found to work well in practice. > >> > >> Thanks, > >> > >> Tim > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > > ------=_Part_30829_4573757.1228425179984--