lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-2392) Enable flexible scoring
Date Mon, 12 Apr 2010 09:49:41 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855905#action_12855905
] 

Michael McCandless commented on LUCENE-2392:
--------------------------------------------

bq. Mike, I don't think overlapTermCount should really exist in the Stats.

OK I will remove it -- I was unsure whether it was overkill :)  So
it's purely an index time decision, whether the posIncr 0 tokens
"count".

Hmm, but... we have a problem, which is that these posIncr 0 tokens
are now counted in the unique token count.  Have to mull how to avoid
that...hmm... to do it correctly, I think means "count this token as
+1 on the unique tokens for this doc if ever it has non-zero posIncr"?

Really, maybe somehow we should be using at attr about the token
itself?  Instead of posIncr == 0?  I mean a broken synonym injector
could conceivably provide the synonyms first (w/ first one having
posIncr 1), followed by the real term (w/ posIncr 0)?

BTW the cost of storing the stats isn't that bad -- it increases index
size by 1.5%, on a 10M wikipedia index where each doc is 1KB of text
(~171 words per doc on avg).


> Enable flexible scoring
> -----------------------
>
>                 Key: LUCENE-2392
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.1
>
>         Attachments: LUCENE-2392.patch
>
>
> This is a first step (nowhere near committable!), implementing the
> design iterated to in the recent "Baby steps towards making Lucene's
> scoring more flexible" java-dev thread.
> The idea is (if you turn it on for your Field; it's off by default) to
> store full stats in the index, into a new _X.sts file, per doc (X
> field) in the index.
> And then have FieldSimilarityProvider impls that compute doc's boost
> bytes (norms) from these stats.
> The patch is able to index the stats, merge them when segments are
> merged, and provides an iterator-only API.  It also has starting point
> for per-field Sims that use the stats iterator API to compute boost
> bytes.  But it's not at all tied into actual searching!  There's still
> tons left to do, eg, how does one configure via Field/FieldType which
> stats one wants indexed.
> All tests pass, and I added one new TestStats unit test.
> The stats I record now are:
>   - field's boost
>   - field's unique term count (a b c a a b --> 3)
>   - field's total term count (a b c a a b --> 6)
>   - total term count per-term (sum of total term count for all docs
>     that have this term)
> Still need at least the total term count for each field.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message