lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: Poor Lucene Ranking for Short Text
Date Fri, 24 Dec 2004 17:02:53 GMT
I think you are confusing lengthNorm and the overall normalization of the score.  For overall
normalization (prior to a final forced normalization in Hits), Lucene uses the formula you
cite, except that it never sums td_d*idf_t, using instead tf_q*idf_t again, because the former
is computationally intractable (changing even a single document changes the idf values, which
means either that all document norms would have to be computed or that the sum over the document
would need to happen at query time; the former is unacceptable for indexing time with large
indices and the latter is unacceptable for query time with large documents).

lengthNorm is by default 1/sqrt(number_terms_in_document).  It is not 1.0f by default because
1.0f is in general not a good value; e.g., a single occurrence of a term in a 1meg document
is not as significant as a single occurrence of the same term in a 1k document.  However,
I find the default value to need additional damping because it affects the score too much,
especially for small documents.  So, I use something like
   3.0f/log10(1000 + number_terms_in_document)

Chuck

  > -----Original Message-----
  > From: michael.hartmann@web.de [mailto:michael.hartmann@web.de]
  > Sent: Friday, December 24, 2004 8:24 AM
  > To: 'Lucene Users List'
  > Subject: AW: Poor Lucene Ranking for Short Text
  > 
  > Hi Kevin,
  > 
  > Seem like you have some knowledge about the lenghtNorm value in Lucene.
  > Comparing it to the formula in "Modern Information Retrieval" does it
  > sum up
  > the denominator sqrt((sum(tf_d*idf_t)²)) * sqrt((sum(tf_q*idf_t)²))
  > 
  > Just a quick note is ok.
  > 
  > Besides that could you invite me to rojo. There beta status seem to be
  > quite
  > long.
  > 
  > Thanks
  > Michael
  > 
  > | -----Ursprüngliche Nachricht-----
  > | Von:
  > | lucene-user-return-10902-michael.hartmann=web.de@jakarta.apach
  > | e.org
  > | [mailto:lucene-user-return-10902-michael.hartmann=web.de@jakar
  > | ta.apache.org] Im Auftrag von Kevin A. Burton
  > | Gesendet: Mittwoch, 27. Oktober 2004 22:48
  > | An: Lucene Users List
  > | Betreff: Re: Poor Lucene Ranking for Short Text
  > |
  > | Daniel Naber wrote:
  > |
  > | > (Kevin complains about shorter documents ranked higher)
  > | >
  > | >This is something that can easily be fixed. Just use a Similarity
  > | >implementation that extends DefaultSimilarity and that overwrites
  > | >lengthNorm: just return 1.0f there. You need to use that
  > | Similarity for
  > | >indexing and searching, i.e. it requires reindexing.
  > | >
  > | >
  > | What happens when I do this with an existing index? I don't
  > | want to have to rewrite this index as it will take FOREVER
  > |
  > | If the current behavior is all that happens this is fine...
  > | this way I can just get this behavior for new documents that
  > | are added.
  > |
  > | Also... why isn't this the default?
  > |
  > | Kevin
  > |
  > | --
  > |
  > | Use Rojo (RSS/Atom aggregator).  Visit http://rojo.com. Ask
  > | me for an invite!  Also see irc.freenode.net #rojo if you
  > | want to chat.
  > |
  > | Rojo is Hiring! - http://www.rojonetworks.com/JobsAtRojo.html
  > |
  > | If you're interested in RSS, Weblogs, Social Networking,
  > | etc... then you should work for Rojo!  If you recommend
  > | someone and we hire them you'll get a free iPod!
  > |
  > | Kevin A. Burton, Location - San Francisco, CA
  > |        AIM/YIM - sfburtonator,  Web - http://peerfear.org/
  > | GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  > |
  > |
  > | ---------------------------------------------------------------------
  > | To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > | For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  > |
  > |
  > 
  > 
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message