lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doron Cohen <cdor...@gmail.com>
Subject Re: Please help me with a basic question...
Date Thu, 19 May 2011 20:38:59 GMT
Hi Rich,
If I understand correctly you are concerned that short documents are
preferred too much over long ones, is this really the case?
It would help to understand what goes on to look at the Explanation of the
score for say two result documents - one that you think is ranked too low,
and one that is ranked too high...
If you are convinced that length normalization is the culprit you could give
a try to:
- omitting norms all together at indexing
- using e.g. SeetSpotSimilarity which do not favor shorter documents.
Regards,
Doron

On Thu, May 19, 2011 at 5:20 PM, Rich Heimann <heimann.richard@gmail.com>wrote:

> Thanks Paul,
>
> I do not know what duplicates are in this case and it is the denominator of
> the TF that bothers me more than the numerator of the TF (if that is in
> fact
> what you are suggesting).
>
> What have been the effects of ignoring the IDF? When is it appropriate. It
> would seem that by doing so rare terms have less (no) weight. Thoughts?
>
> Thanks again,
> Rich
>
>
> On Wed, May 18, 2011 at 3:34 PM, Paul Libbrecht <paul@hoplahup.net> wrote:
>
> > Richard,
> >
> > in SOLR at least there's an analyzer that avoids duplicates.
> > I think that would solve it.
> > There's also somewhere the option to ignore IDF (in similarity? in
> > solrconfig?).
> >
> > paul
> >
> >
> > Le 18 mai 2011 à 21:30, Rich Heimann a écrit :
> >
> > > Hello all,
> > >
> > > This is my first time on the list and my first question...forgive me it
> > this
> > > has been hacked out in the past.
> > >
> > > We have set up Lucene/Solr and are getting somewhat spurious results.
> It
> > > appears to be a result of heterogeneous document sizes. In other words,
> > the
> > > top results are sometimes (at least when the user is using typical
> search
> > > terms) monopolized by a distinct type of document, which is otherwise
> > small
> > > (in number of terms). It appears that TF/IDF even with the cosine
> > similarity
> > > seems to be sensitive to document size. I have run some tests and it in
> > fact does
> > > appear to be the case.
> > >
> > > (Number of times the term appears in a document)/(Total Number of terms
> > in
> > > that document) * Log10(Number of total documents/Number of times search
> > term
> > > appears in all documents)
> > >
> > > Are there any suggestions or best practices to deal with the intrinsic
> > > heterogeneity in a corpus.
> > >
> > > Thank you,
> > > Rich
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message