lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Libbrecht <p...@hoplahup.net>
Subject Re: Please help me with a basic question...
Date Wed, 18 May 2011 19:34:57 GMT
Richard,

in SOLR at least there's an analyzer that avoids duplicates.
I think that would solve it.
There's also somewhere the option to ignore IDF (in similarity? in solrconfig?).

paul


Le 18 mai 2011 à 21:30, Rich Heimann a écrit :

> Hello all,
> 
> This is my first time on the list and my first question...forgive me it this
> has been hacked out in the past.
> 
> We have set up Lucene/Solr and are getting somewhat spurious results. It
> appears to be a result of heterogeneous document sizes. In other words, the
> top results are sometimes (at least when the user is using typical search
> terms) monopolized by a distinct type of document, which is otherwise small
> (in number of terms). It appears that TF/IDF even with the cosine similarity
> is sensitive to document size. I have run some tests and it in fact does
> appear to be the case.
> 
> (Number of times the term appears in a document)/(Total Number of terms in
> that document) * Log10(Number of total documents/Number of times search term
> appears in all documents)
> 
> Are there any suggestions or best practices to deal with the intrinsic
> heterogeneity in a corpus.
> 
> Thank you,
> Rich


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message