Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 76124 invoked from network); 12 Apr 2010 09:58:07 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 12 Apr 2010 09:58:07 -0000 Received: (qmail 89882 invoked by uid 500); 12 Apr 2010 09:58:06 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 89796 invoked by uid 500); 12 Apr 2010 09:58:06 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 89788 invoked by uid 99); 12 Apr 2010 09:58:05 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Apr 2010 09:58:05 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 12 Apr 2010 09:58:03 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o3C9vfRY024225 for ; Mon, 12 Apr 2010 05:57:41 -0400 (EDT) Message-ID: <23896498.19731271066261391.JavaMail.jira@thor> Date: Mon, 12 Apr 2010 05:57:41 -0400 (EDT) From: "Michael McCandless (JIRA)" To: java-dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2392) Enable flexible scoring In-Reply-To: <28997811.11491271007041335.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855906#action_12855906 ] Michael McCandless commented on LUCENE-2392: -------------------------------------------- bq. I think what I'm saying is that if we can open up the norms computation to custom code - that will do what I want, right? I'm calling the norms "boost bytes" :) This was Marvin's term.. I like it. This patch makes boost byte computation completely private to the sim (see the *FieldSimProvider). Ie the sim providers walk the stats and do whatever they want to "prepare" for real searching. EG if you have the RAM maybe you want to use a true float[] not boost bytes. Or if you really don't have the RAM maybe you use only 4 bits per-doc, not 8. The FieldSim just provides a "float boost(int docID)" so what it does under the hood is private. bq. Maybe we can have a class like DocLengthProvider which apps can plug in if they want to customize how that length is computed. So... I'm actually trying to avoid extensibility on the first go, here (this is the "baby steps" part of the original thread). Ie, the IR world seems to have converged on a smallish set of "stats" that are commonly required, so I'd like to make those initial stats work well, for starters. Commit that (it enables all sorts of state of the art scoring models), and perhaps cutover to the default Robert created in LUCENE-2187 (which needs stats to work correctly). And then (phase 2) work out plugability so you can put your own stats in.... bq. Wherever we write the norms, we'll call that impl, which by default will do what Lucene does today? Right, this is the DefaultSimProvider in my current patch -- it simply computes the same thing Lucene does today, but uses the stats at IR open time (once it's hooked up) to do, instead of doing so during indexing. bq. I think though that it's not a field-level setting, but an IW one? It's field level now and I think we should keep it that way. EG Terrier was apparently document oriented in the past but has now deprecated that and moved to per-field. You can always make a catch-all field if you "really" want aggregated stats across the entire doc? > Enable flexible scoring > ----------------------- > > Key: LUCENE-2392 > URL: https://issues.apache.org/jira/browse/LUCENE-2392 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 3.1 > > Attachments: LUCENE-2392.patch > > > This is a first step (nowhere near committable!), implementing the > design iterated to in the recent "Baby steps towards making Lucene's > scoring more flexible" java-dev thread. > The idea is (if you turn it on for your Field; it's off by default) to > store full stats in the index, into a new _X.sts file, per doc (X > field) in the index. > And then have FieldSimilarityProvider impls that compute doc's boost > bytes (norms) from these stats. > The patch is able to index the stats, merge them when segments are > merged, and provides an iterator-only API. It also has starting point > for per-field Sims that use the stats iterator API to compute boost > bytes. But it's not at all tied into actual searching! There's still > tons left to do, eg, how does one configure via Field/FieldType which > stats one wants indexed. > All tests pass, and I added one new TestStats unit test. > The stats I record now are: > - field's boost > - field's unique term count (a b c a a b --> 3) > - field's total term count (a b c a a b --> 6) > - total term count per-term (sum of total term count for all docs > that have this term) > Still need at least the total term count for each field. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org