Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 99908 invoked from network); 15 Jun 2007 16:18:49 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Jun 2007 16:18:49 -0000 Received: (qmail 52759 invoked by uid 500); 15 Jun 2007 16:18:45 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 52719 invoked by uid 500); 15 Jun 2007 16:18:45 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 52708 invoked by uid 99); 15 Jun 2007 16:18:45 -0000 Received: from herse.apache.org (HELO herse.apache.org) (140.211.11.133) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Jun 2007 09:18:45 -0700 X-ASF-Spam-Status: No, hits=2.0 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (herse.apache.org: domain of erickerickson@gmail.com designates 209.85.134.185 as permitted sender) Received: from [209.85.134.185] (HELO mu-out-0910.google.com) (209.85.134.185) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 15 Jun 2007 09:18:40 -0700 Received: by mu-out-0910.google.com with SMTP id g7so1004010muf for ; Fri, 15 Jun 2007 09:18:18 -0700 (PDT) DKIM-Signature: a=rsa-sha1; c=relaxed/relaxed; d=gmail.com; s=beta; h=domainkey-signature:received:received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=uP2ypwkIh0YJ1u21wngD2OaTshBTWiSq/uzoTRaazLfGwHCOFvO0HX9iUpn6Fp6lhlvw9LEF0mnkdD6kl7xgm2w2YVrSKzUHGtxpZbIC6WbaEX0LQAQJ2kjfBLXBJYmfILXfCeZmSBWFk/vaXSEYFzG0etvg+xikrzNAXzPxBeA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=beta; h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; b=lYKcz/Z4O+cw4N9DwG57ZUxcfYDhOEvaNR+8nSpqrsavgYIjwj3C/zNhq3VN7c0dx572PGyECHnqCa54PI4pWmFw7/hb8XuKetUrFla8ebj6ZLVdDkervuaa9B58vfFZj0EsETsODgu7APP4jKA5lAxAZVup3LHGIoFb+jGsevg= Received: by 10.82.126.5 with SMTP id y5mr5976150buc.1181924298260; Fri, 15 Jun 2007 09:18:18 -0700 (PDT) Received: by 10.82.167.3 with HTTP; Fri, 15 Jun 2007 09:18:18 -0700 (PDT) Message-ID: <359a92830706150918g27e02e70h9bd72344f541c39f@mail.gmail.com> Date: Fri, 15 Jun 2007 12:18:18 -0400 From: "Erick Erickson" To: java-user@lucene.apache.org Subject: Re: Several questions about scoring/sorting + random sorting in an image/related application In-Reply-To: <8E10FB04-3084-4769-952F-1C1B8A96BA93@taktik.be> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_Part_38405_14514287.1181924298229" References: <4672AE55.20609@garambrogne.net> <4672B361.8080409@garambrogne.net> <8E10FB04-3084-4769-952F-1C1B8A96BA93@taktik.be> X-Virus-Checked: Checked by ClamAV on apache.org ------=_Part_38405_14514287.1181924298229 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Another possibility is to re-think this a bit. You are "displaying documents one page at a time", which I take to mean you are displaying some number (say 50) document summaries per page. I'm also assuming that you want to display ALL documents from, say, collection 32 and then (and only then) display the documents in the next-ranking collection. Let's further assume that no collection has less than 50 docs for discussion purposes, but that's not a requirement. At server startup, you compute the number of documents in each bucket. When a user pages through results, you can reasonably confine the search to just two or three of your collections and easily add a boosted clause for those few collections. The tricky part of this would be to know that a request for results 200-250 spanned across collections 100 and 400, and making sure that the page boundaries were computed correctly. Don't know if this is reasonable, but occurred to me... Best Erick On 6/15/07, Antoine Baudoux wrote: > > The problem is that i want lucene to do the sorting, because the > query qould return thousands of results, and I'm displaying documents > one page at a time. > -- > Antoine Baudoux > Development Manager > ab@taktik.be > T=E9l.: +32 2 333 58 44 > GSM: +32 499 534 538 > Fax.: +32 2 648 16 53 > > > On 15 Jun 2007, at 17:42, Mathieu Lecarme wrote: > > > First step is to feed a Set with "collection" > > Second step is to sort it. > > > > With a sortedSet, you can do that, isnt'it? > > > > M. > > > > > > Antoine Baudoux a =E9crit : > >> Could-you be more precise? I dont understand what you mean. > >> > >> > >> > >> On 15 Jun 2007, at 17:20, Mathieu Lecarme wrote: > >> > >>> Your request seems to be a two steps query. > >>> First step, you select image, and then collection > >>> Second step, you sort collection. > >>> > >>> BitVector can help you? > >>> > >>> M. > >>> Antoine Baudoux a =E9crit : > >>>> Hi, > >>>> > >>>> I'm developping an image database. Each lucene document > >>>> representing an image contains (among other fields ): > >>>> > >>>> - a date field > >>>> - a collection field containing the ID of the collection the > >>>> image > >>>> belongs to. > >>>> > >>>> I want to be able to give a score to each collection. > >>>> Collections > >>>> with a higher score appear first in the results. I want to avoid > >>>> re-indexing all the documents each time i change my collection > >>>> scores. > >>>> > >>>> For example on day 1 I decide to give collection #1 a 5 > >>>> score and > >>>> collection #3 a 10 score --> images belonging to collection #3 > >>>> appear > >>>> first in search results. > >>>> One day 2 i give collection #3 a 2 score --> images > >>>> belonging to > >>>> collection #1 appear first in search results. > >>>> > >>>> I have read the lucene docs, and from what i understand > >>>> there are > >>>> many ways to achieve what I want : > >>>> > >>>> > >>>> - I can use a Very big Boolean query (OR query in fact) > >>>> containing > >>>> one TermQuery per collection ID, setting the correct boost > >>>> factor for > >>>> each termquery. The problem with this is that i have 300 > >>>> collections, > >>>> so i have a boolean query with 300 terms that i append to each > >>>> query i > >>>> make. I am afraid that it will be slow. > >>>> > >>>> - I can use a ValueSourceQuery, where for each document i > >>>> compute > >>>> a custom score based on the value of the collection field. Will > >>>> it be > >>>> faster than the first solution? > >>>> > >>>> - I can do advanced things such as writing a custom > >>>> HitCollector, > >>>> or a custom Query. > >>>> > >>>> - I can add another field to each document, containing a > >>>> computed > >>>> custom score, then i could sort on that field. But i want to avoid > >>>> this solution at all costs, since it would mean re-indexing all the > >>>> documents each time the collection scores change. > >>>> > >>>> What solution do you suggest? Is there another solution that i > >>>> didnt mention? > >>>> > >>>> More recent documents should also come first : In fact the > >>>> final > >>>> sorting should be a ponderated sum between the collection score > >>>> of an > >>>> image and the date of an image : most recent images from the > >>>> best-scored collections come first, then most recent from less- > >>>> scrored > >>>> collections, then less recent from best scored, and so on. I would > >>>> also like to be able to adjust the balance between date/collection > >>>> score. > >>>> > >>>> What solution do you suggest? > >>>> > >>>> > >>>> I would also like to implement random-sorting. My solution > >>>> is : i > >>>> create 12 new fields R1 -> R12 for each document, each containing a > >>>> random number between 1 and 12. To get a random sort, i sort > >>>> each day > >>>> with a different combination of R1 .. R12. For example : > >>>> > >>>> Day 1 : i sort by R1 then R4 then R5.. > >>>> Day 2 : i sort by R10 then R9 then R2.... > >>>> etc... > >>>> > >>>> Is it a good solution? Is there another way to do it? > >>>> > >>>> > >>>> Very big thx in advance for your answers. > >>>> > >>>> Antoine > >>>> > >>>> ------------------------------------------------------------------- > >>>> -- > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org > >>>> > >>>> > >>> > >>> > >>> -------------------------------------------------------------------- > >>> - > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > >>> For additional commands, e-mail: java-user-help@lucene.apache.org > >>> > >>> > >> > >> > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > > For additional commands, e-mail: java-user-help@lucene.apache.org > > > > > > ------=_Part_38405_14514287.1181924298229--