lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: [jira] Commented: (LUCENE-2392) Enable flexible scoring
Date Mon, 12 Apr 2010 08:37:46 GMT
I'm not sure Robert where did I propose to shove random statistics into the
index? Lucene calculated a doc length today which some in the
academy/research here disagree w/ how it's done. So instead of attempting to
fix it for all, I think it'd be great if one can define what is the doc
Length as one perceives it. Why is that problematic?

What Mike opened is an issue titled "enable flexible scoring" ... what I'm
asking for falls under that hood?

Also, maybe we should have that discussion on the issue?

Shai

On Mon, Apr 12, 2010 at 11:31 AM, Robert Muir <rcmuir@gmail.com> wrote:

> I disagree. I think what Mike has defined here is way beyond a baby-step:
> its all the stats needed to support modern IR models in Lucene: BM25,
> additional vector space algorithms, divergence from randomness, and language
> modelling.
>
> I think the ability to calculate your own random statistics and shove them
> into the index (this would be messy like how to get access to the aggregates
> you need anyway) is something different entirely, best left to research
> systems.
>
> You can't even do that with Terrier now.
>
> On Mon, Apr 12, 2010 at 3:35 AM, Shai Erera (JIRA) <jira@apache.org>wrote:
>
>>
>>    [
>> https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12855875#action_12855875]
>>
>> Shai Erera commented on LUCENE-2392:
>> ------------------------------------
>>
>> Mike - it'll also be great if we can store the length of the document in a
>> custom way. I think what I'm saying is that if we can open up the norms
>> computation to custom code - that will do what I want, right? Maybe we can
>> have a class like DocLengthProvider which apps can plug in if they want to
>> customize how that length is computed. Wherever we write the norms, we'll
>> call that impl, which by default will do what Lucene does today?
>> I think though that it's not a field-level setting, but an IW one?
>>
>> > Enable flexible scoring
>> > -----------------------
>> >
>> >                 Key: LUCENE-2392
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-2392
>> >             Project: Lucene - Java
>> >          Issue Type: Improvement
>> >          Components: Search
>> >            Reporter: Michael McCandless
>> >            Assignee: Michael McCandless
>> >             Fix For: 3.1
>> >
>> >         Attachments: LUCENE-2392.patch
>> >
>> >
>> > This is a first step (nowhere near committable!), implementing the
>> > design iterated to in the recent "Baby steps towards making Lucene's
>> > scoring more flexible" java-dev thread.
>> > The idea is (if you turn it on for your Field; it's off by default) to
>> > store full stats in the index, into a new _X.sts file, per doc (X
>> > field) in the index.
>> > And then have FieldSimilarityProvider impls that compute doc's boost
>> > bytes (norms) from these stats.
>> > The patch is able to index the stats, merge them when segments are
>> > merged, and provides an iterator-only API.  It also has starting point
>> > for per-field Sims that use the stats iterator API to compute boost
>> > bytes.  But it's not at all tied into actual searching!  There's still
>> > tons left to do, eg, how does one configure via Field/FieldType which
>> > stats one wants indexed.
>> > All tests pass, and I added one new TestStats unit test.
>> > The stats I record now are:
>> >   - field's boost
>> >   - field's unique term count (a b c a a b --> 3)
>> >   - field's total term count (a b c a a b --> 6)
>> >   - total term count per-term (sum of total term count for all docs
>> >     that have this term)
>> > Still need at least the total term count for each field.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of the administrators:
>> https://issues.apache.org/jira/secure/Administrators.jspa
>> -
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>

Mime
View raw message