lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <luc...@mikemccandless.com>
Subject Re: Scoring formula - Average number of terms in IDF
Date Fri, 18 Dec 2009 11:58:51 GMT
I'm not sure this specific detail (how IW uses Similarity) is
documented -- best "documentation" is the source code ;)

Have a look at oal.index.NormsWriterPerField.  That's where the
default indexing chain asks Similarity to create the norm.

Mike

On Fri, Dec 18, 2009 at 5:12 AM, kdev <v.verroios@di.uoa.gr> wrote:
>
> The avg is used only in the idf method of the Similarity class. So I guess
> there is workaround for what I want to do. Can you give me a reference, on
> lucene doc, on how a IndexWriter uses the provided Similarity class?
>
> Thanks again for your time and your help.
>
>
> Michael McCandless-2 wrote:
>>
>> IndexWriter uses Similarity.lengthNorm to create a norm (boost for the
>> field, per document) based on the length of the field... it doesn't
>> invoke the other methods on Similarity.
>>
>> Are you saying you need to know the avg across the whole corpus before
>> computing that boost?
>>
>> Mike
>>
>> On Thu, Dec 17, 2009 at 10:50 AM, kdev <v.verroios@di.uoa.gr> wrote:
>>>
>>> If I follow your approach, and produce the avg(outside of Lucene) while I
>>> 'm
>>> building the index(due to performance reasons I can't wait for all the
>>> documents to arrive before indexing them) for a collection, the avg will
>>> be
>>> ready only when all the documents of the collection are indexed.
>>> Lucene states that the new similarity class must be set in
>>> IndexWriter.setSimilarity(), and be used while I build the index, and in
>>> this time the avg isn't ready yet. Is there a way to overcome this? And
>>> if
>>> not calculating the score while the index is being created, and only when
>>> searching the index, what will the consequence in performance be?
>>>
>>> (Mike thank you about your response)
>>>
>>>
>>> Michael McCandless-2 wrote:
>>>>
>>>> There have been some discussions, here:
>>>>
>>>>     https://issues.apache.org/jira/browse/LUCENE-2091
>>>>
>>>> about how Lucene could track avg field/doc length, but they are just
>>>> brainstorming type discussions now.
>>>>
>>>> You could always do something approximate outside of Lucene?  EG, make
>>>> a TokenFilter that counts how many tokens are produced for each
>>>> field/doc, aggregate & store that yourself, and use it in your
>>>> similarity impl?
>>>>
>>>> Mike
>>>>
>>>> On Tue, Dec 15, 2009 at 5:04 AM, kdev <v.verroios@di.uoa.gr> wrote:
>>>>>
>>>>> any ideas please?
>>>>> --
>>>>> View this message in context:
>>>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26792364.html
>>>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26830145.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Scoring-formula---Average-number-of-terms-in-IDF-tp26282578p26841521.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message