lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jake Mannix <jake.man...@gmail.com>
Subject Re: Whither Query Norm?
Date Sat, 21 Nov 2009 00:49:34 GMT
On Fri, Nov 20, 2009 at 4:20 PM, Mark Miller <markrmiller@gmail.com> wrote:

> Mark Miller wrote:
> Okay - I guess that somewhat makes sense - you can calculate the
> magnitude of the doc vectors at index time. How is that impossible with
> incremental indexing though? Isn't it just expensive? Seems somewhat
> expensive in the non incremental case as well - your just eating it at
> index time rather than query time - though the same could be done for
> incremental? The information is all there in either case.
>
>
Ok, I think I see what you were imagining I was doing: you take the current
state of the index as gospel for idf (when the index is already large, this
is a good approximation), and look up these factors at index time - this
means grabbing docFreq(Term) for each term in my document, and yes,
this would be very expensive, I'd imagine.  I've done it by pulling a
monstrous (the most common 1-million terms, say) Map<String, Float>
(effectively) outside of lucene entirely, which gives term idfs, and housing
this in memory so that computing field norms for cosine is a very fast
operation at index time.

Doing it like this is hard from scratch, but is fine incrementally, because
I've basically fixed idf using some previous corpus (and update the idfMap
every once in a while, in cases where it doesn't change much).  This has
the effect of also providing a global notion of idf in a distributed corpus.

  -jake


>
>
>

> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>

Mime
View raw message