lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chuck Williams" <ch...@manawiz.com>
Subject RE: Relevance and ranking ...
Date Sat, 18 Dec 2004 17:43:11 GMT
The coord is the fraction of clauses matched in a BooleanQuery, so with
your example of a 5-word BooleanQuery, the coord factors should be .4,
.8, .8, 1.0 respectively for doc1, doc2, doc3 and doc4.

One big issue you've got here is lengthNorm.  Doc2 is 1/10 the size of
doc4, so its lengthNorm is over 3x larger (sqrt(10)).  This more than
makes up for the difference in coord.  In your original post you
indicated a desire for a linear lengthNorm, which would actually make
this problem much worse.  You problem need to tone down the lengthNorm
instead (I turn mine off entirely, at least so far, by fixing it at 1.0;
this is not good in general, but got me past similar problems until I
can find a good formula).  You might try an inverse-log lengthNorm with
a high base (like the formula for idf I posted earlier).

The other thing that can bite you is the tf and idf computations.  E.g.,
if manual is a more common term than the others, this could cause the
tf*idf scores on doc2 to more than compensate for the difference in
coord, even if you set lengthNorm to be 1.0.

What is happening will be apparent from the explanations.  If you print
these out and post them, I'd be happy to suggest specific formulas.
Just use code like this:

          IndexSearcher searcher = new IndexSearcher(directory);
          System.out.println(query);
          Hits hits = searcher.search(query);
          for (int i=0; i<hits.length(); i++) {
              Document doc = hits.doc(i);
              System.out.print(hits.score(i));
              // Use whatever your fields are here:
              System.out.print("  title:");
              System.out.print(doc.get("title"));
              System.out.print(" description:");
              System.out.println(doc.get("description"));
              // End of fields
              System.out.println(searcher.explain(query, hits.id(i)));
              System.out.println("--------------------------");
          }

Chuck

  > -----Original Message-----
  > From: Gururaja H [mailto:guru_hr29@yahoo.com]
  > Sent: Saturday, December 18, 2004 4:56 AM
  > To: Lucene Users List
  > Subject: Re: Relevance and ranking ...
  > 
  > Hi Erik,
  > 
  > Created my own subclass of Similarity.  When i printed the values
for
  > coord() factor
  > i am getting the same for all the 4 documents.  So the value is NOT
  > getting boosted.
  > Want to do this. as i want the document that has
  > e.g., all three terms in a three word query over those that contain
just
  > two
  > of the words.
  > 
  > Please let me how do i go about doing this ?  Please explain the
  > coordination factor ?
  > 
  > The default order of document that i get for my example given in
this
  > thread is as follows:
  > Doc#2
  > Doc#4
  > Doc#3
  > Doc#1
  > 
  > Any inputs would be help full.  Thanks,
  > 
  > Gururaja
  > 
  > Erik Hatcher <erik@ehatchersolutions.com> wrote:
  > 
  > On Dec 17, 2004, at 6:09 AM, Gururaja H wrote:
  > > Thanks for the reply. Is there any sample code which tells me how
to
  > > change these
  > > coord() factor, overlapping, lenght normalizaiton etc.. ??
  > > If there are any please provide me.
  > 
  > Have a look at Lucene's DefaultSimilarity code itself. Use that as a
  > starting point - in fact you should subclass it and only override
the
  > one or two methods you want to tweak.
  > 
  > There are probably some other examples in Lucene's test cases, or
that
  > have been posted to the list but I don't have handy pointers to
them.
  > 
  > Erik
  > 
  > 
  > >
  > > Thanks,
  > > Gururaja
  > >
  > >
  > > Erik Hatcher wrote:
  > > The coord() factor of Similarity is what controls a muliplier
factor
  > > for overlapping query terms in a document. The DefaultSimilarity
  > > already contains factors that allow documents with overlapping
terms
  > to
  > > get boosted. Is this not working for you? You may also need to
adjust
  > > length normalization factors. Check the javadocs on Similarity for
  > > details on implementing your own formulas. Also become familiar
with
  > > IndexSearcher.explain() and the Explanation so that you can see
how
  > > adjusting things affects the details.
  > >
  > > Erik
  > >
  > > On Dec 17, 2004, at 3:42 AM, Gururaja H wrote:
  > >
  > >> Hi,
  > >>
  > >> How to implement the following ? Please provide inputs ....
  > >>
  > >>
  > >> For example, if the search query has 5 terms (ibm, risc, tape,
drive,
  > >> manual) and there are 4 matching documents with the following
  > >> attributes, then the order should be as described below.
  > >>
  > >> Doc#1: contains terms (ibm, drive) and has a total of 100 terms
in
  > the
  > >> document.
  > >>
  > >> Doc#2: contains terms (ibm, risc, tape, drive) and has a total of
30
  > >> terms in the document.
  > >>
  > >> Doc#3: contains terms (ibm, risc, tape, drive) and has a total of
100
  > >> terms in the document.
  > >>
  > >> Doc#4: contains terms (ibm, risc, tape, drive, manual) and has a
  > total
  > >> of 300 terms in the document
  > >>
  > >> The search results should include all three documents since each
has
  > >> one or more of the search terms, however, the order should be
  > returned
  > >> as:
  > >>
  > >> Doc#4
  > >>
  > >> Doc#2
  > >>
  > >> Doc#3
  > >>
  > >> Doc#1
  > >>
  > >> Doc#4 should be first, since of the 5 search terms, it contains
all 5.
  > >>
  > >> Doc#2 should be second, since it has 4 of the 5 search terms and
of
  > >> the number of terms in the document, its ratio is higher than
Doc#3
  > >> (4/30). Doc#3 has 4 of the 5 terms, but its ratio is 4/100.
  > >>
  > >> Doc#1 is last since it only has 2 of the 5 terms.
  > >>
  > >>
  > >> ----
  > >>
  > >> Thanks,
  > >> Gururaja
  > >>
  > >>
  > >> __________________________________________________
  > >> Do You Yahoo!?
  > >> Tired of spam? Yahoo! Mail has the best spam protection around
  > >> http://mail.yahoo.com
  > >
  > >
  > >
---------------------------------------------------------------------
  > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > > For additional commands, e-mail:
lucene-user-help@jakarta.apache.org
  > >
  > >
  > >
  > > ---------------------------------
  > > Do you Yahoo!?
  > > Send holiday email and support a worthy cause. Do good.
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
  > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
  > 
  > 
  > __________________________________________________
  > Do You Yahoo!?
  > Tired of spam?  Yahoo! Mail has the best spam protection around
  > http://mail.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Mime
View raw message