lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Otis Gospodnetic <otis_gospodne...@yahoo.com>
Subject Re: Whither Query Norm?
Date Wed, 25 Nov 2009 05:31:49 GMT
Hello,

Regarding that monstrous term->idf map.
Is this something that one could use to adjust the scores in http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations
scenario?  Say a map like that was created periodically for each shard and distributed to
all other nodes (so in the end each node has all maps locally).  Couldn't the local scorer
in the Solr instance (and in distributed Lucene setup) consult idfs for relevant terms in
all those maps and adjust the scores of local scores before returning results?

Otis

>From: Jake Mannix <jake.mannix@gmail.com>
>To: java-dev@lucene.apache.org
>Sent: Fri, November 20, 2009 7:49:34 PM
>Subject: Re: Whither Query Norm?
>
>
>
>
>On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <markrmiller@gmail.com> wrote:
>
>Mark Miller wrote:
>>Okay - I guess that somewhat makes sense - you can calculate the
>>>>magnitude of the doc vectors at index time. How is that impossible with
>>>>incremental indexing though? Isn't it just expensive? Seems somewhat
>>>>expensive in the non incremental case as well - your just eating it at
>>>>index time rather than query time - though the same could be done for
>>>>incremental? The information is all there in either case.
>>
>>
>
>Ok, I think I see what you were imagining I was doing: you take the current
>state of the index as gospel for idf (when the index is already large, this 
>>is a good approximation), and look up these factors at index time - this 
>means grabbing docFreq(Term) for each term in my document, and yes,
>this would be very expensive, I'd imagine.  I've done it by pulling a
>>monstrous (the most common 1-million terms, say) Map<String, Float> 
>(effectively) outside of lucene entirely, which gives term idfs, and housing
>this in memory so that computing field norms for cosine is a very fast
>>operation at index time.
>
>Doing it like this is hard from scratch, but is fine incrementally, because 
>I've basically fixed idf using some previous corpus (and update the idfMap
>every once in a while, in cases where it doesn't change much).  This has
>>the effect of also providing a global notion of idf in a distributed corpus.
>
>  -jake
> 
>
>>
>> 
>
>>>>---------------------------------------------------------------------
>>>>To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
>>>>For additional commands, e-mail: java-dev-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message