lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rich Heimann <heimann.rich...@gmail.com>
Subject Please help me with a basic question...
Date Wed, 18 May 2011 19:30:37 GMT
Hello all,

This is my first time on the list and my first question...forgive me it this
has been hacked out in the past.

We have set up Lucene/Solr and are getting somewhat spurious results. It
appears to be a result of heterogeneous document sizes. In other words, the
top results are sometimes (at least when the user is using typical search
terms) monopolized by a distinct type of document, which is otherwise small
(in number of terms). It appears that TF/IDF even with the cosine similarity
is sensitive to document size. I have run some tests and it in fact does
appear to be the case.

(Number of times the term appears in a document)/(Total Number of terms in
that document) * Log10(Number of total documents/Number of times search term
appears in all documents)

Are there any suggestions or best practices to deal with the intrinsic
heterogeneity in a corpus.

Thank you,
Rich

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message