Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 59800 invoked from network); 8 Jun 2010 15:42:22 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 8 Jun 2010 15:42:22 -0000 Received: (qmail 11638 invoked by uid 500); 8 Jun 2010 15:42:21 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 11545 invoked by uid 500); 8 Jun 2010 15:42:20 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 11538 invoked by uid 99); 8 Jun 2010 15:42:20 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jun 2010 15:42:20 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of john.wang@gmail.com designates 209.85.211.200 as permitted sender) Received: from [209.85.211.200] (HELO mail-yw0-f200.google.com) (209.85.211.200) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 08 Jun 2010 15:42:12 +0000 Received: by ywh38 with SMTP id 38so3828952ywh.23 for ; Tue, 08 Jun 2010 08:41:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:received:in-reply-to :references:date:message-id:subject:from:to:content-type; bh=rfxXNO4eTV8HaJBnkPnYSObYIB2Q/ernNNdgQwIKPVc=; b=GKMzwM2Ixu4F2D5TKjKcRtmnmfyTPVQHzbkeuOaKCYBO2AsNUteMj1veDfLVPZg0+T VBXkG6BQUCsxnkE9WMeUAzzg8rZ5hYx8WRbJ/NzqObV/wr9VDDsyJOpzLGoVKfxziEJj EHcrRr4uS2pEDrNlQUmXI2bbTRfXi++ytXcWA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=fZzs9l055QrkLwvoHq6ofKC7WE6knhH37T9Ku43tDnQcUyt+O6yLEmlRP+awIIYXNZ xfRNzwRtxplokRwukvXVV+UlVRzZLemJXp9iIQ0yAQ7ENfq+2KwFDxwFPnkwZSAR2pjs HDLrznST8MN05n2+aprdnnuk/LUA0GVsqkH7s= MIME-Version: 1.0 Received: by 10.224.65.158 with SMTP id j30mr1465004qai.390.1276011711364; Tue, 08 Jun 2010 08:41:51 -0700 (PDT) Received: by 10.229.233.133 with HTTP; Tue, 8 Jun 2010 08:41:51 -0700 (PDT) In-Reply-To: References: Date: Tue, 8 Jun 2010 08:41:51 -0700 Message-ID: Subject: Re: Proposal: Scorer api change From: John Wang To: dev@lucene.apache.org Content-Type: multipart/alternative; boundary=00c09f9721e2c904b7048886a072 X-Virus-Checked: Checked by ClamAV on apache.org --00c09f9721e2c904b7048886a072 Content-Type: text/plain; charset=ISO-8859-1 Hi Shai: Similarity in many cases is not sufficient for scoring. For example, to implement age decaying of a document (very useful for corpuses like news or tweets), you want to project the raw tfidf score onto a time curve, say f(x), to do this, you'd have a custom scorer that decorates the underlying scorer from your say, boolean query: public float score(){ return myFunc(innerScorer.score()); } This is fine, but then you would have to do this as well: public int nextDoc(){ return innerScorer.nextDoc(); } and also: public int advance(int target){ return innerScorer.advance(); } The difference here is that nextDoc and advance are called far more times as score. And you are introducing an extra method call for them, which is not insignificant for queries result in large recall sets. Hope this makes sense. Thanks -John On Tue, Jun 8, 2010 at 5:02 AM, Shai Erera wrote: > I'm not sure I understand what you mean - Scorer is a DISI itself, and the > scoring formula is mostly controlled by Similarity. > > What will be the benefits of the proposed change? > > Shai > > > On Tue, Jun 8, 2010 at 8:25 AM, John Wang wrote: > >> Hi guys: >> >> I'd like to make a proposal to change the Scorer class/api to the >> following: >> >> >> public abstract class Scorer{ >> DocIdSetIterator getDocIDSetIterator(); >> float score(int docid); >> } >> >> Reasons: >> >> 1) To build a Scorer from an existing Scorer (e.g. that produces raw >> scores from tfidf), one would decorate it, and it would introduce overhead >> (in function calls) around nextDoc and advance, even if you just want to >> augment the score method which is called much fewer times. >> >> 2) The current contract forces scoring on the currentDoc in the underlying >> iterator. So once you pass "current", you can no longer score. In one of our >> use-cases, it is very inconvenient. >> >> What do you think? I can go ahead and open an issue and work on a patch if >> I get some agreement. >> >> Thanks >> >> -John >> > > --00c09f9721e2c904b7048886a072 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Hi Shai:

=A0=A0 =A0Similarity in many cases is not suffi= cient for scoring. For example, to implement age decaying of a document (ve= ry useful for corpuses like news or tweets), you want to project the raw tf= idf score onto a time curve, say f(x), to do this, you'd have a custom = scorer that decorates the underlying scorer from your say, boolean query:

public float score(){
=A0=A0 =A0return myFunc= (innerScorer.score());
}

=A0=A0 =A0This = is fine, but then you would have to do this as well:

public int nextDoc(){
=A0=A0 return innerScorer.nextDoc();
}

<= div>and also:

public int advance(int target){
=A0=A0 return innerScorer.advance();
}
=A0
<= div>=A0=A0 =A0 The difference here is that nextDoc and advance are called f= ar more times as score. And you are introducing an extra method call for th= em, which is not insignificant for queries result in large recall sets.

Hope this makes sense.

Thanks<= /div>

-John

On = Tue, Jun 8, 2010 at 5:02 AM, Shai Erera <serera@gmail.com> wrote:
I'm not sure I underst= and what you mean - Scorer is a DISI itself, and the scoring formula is mos= tly controlled by Similarity.

What will be the benefits of the proposed change?

Shai


On Tue, Jun 8, 2010 at 8:25 AM, John Wang <jo= hn.wang@gmail.com> wrote:
Hi guys:

=A0=A0=A0 I'd like to make a proposal to change the Sco= rer class/api to the following:


public abstract class Scorer{=A0=A0 DocIdSetIterator getDocIDSetIterator();
=A0=A0 float score(int d= ocid);
}

Reasons:

1) To build a Scorer from an existing Scorer (e.g. that= produces raw scores from tfidf), one would decorate it, and it would intro= duce overhead (in function calls) around nextDoc and advance, even if you j= ust want to augment the score method which is called much fewer times.

2) The current contract forces scoring on the currentDoc in the underly= ing iterator. So once you pass "current", you can no longer score= . In one of our use-cases, it is very inconvenient.

What do you thin= k? I can go ahead and open an issue and work on a patch if I get some agree= ment.

Thanks

-John


--00c09f9721e2c904b7048886a072--