lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Strange Fuzzyquery results scoring when using a low minimal distance
Date Tue, 16 Feb 2010 09:49:42 GMT
The problem ist he following:
The docFreq of the term "lucéne" is 2, all other terms have 1 (because StandardAnalyzer lowercases
everything). What happens is, that terms with lower docFreq get a higher score in TermQuery.
This score overweighs the boosting done by FuzzyQuery (because you index is so small).

If you raise the minSimilarity a little bit, your query matches less terms and the rewritten
BooleanQuery contains less clauses. At some point the score overweigh of the less frequent
terms is no longer relevant for the final score. 

By the way, you can always look at the explain() results which informs you about the scoring
done.

The fix is (applies only to trunk, see issue https://issues.apache.org/jira/browse/LUCENE-124)
to ignore scoring of the TermQueries generated by Fuzzy and only look at the edit distance
(implemented by another MTQ.RewriteMode), that can be set with FuzzyQuery.setRewriteMode().

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: stefcl [mailto:stefatwork@gmail.com]
> Sent: Tuesday, February 16, 2010 10:11 AM
> To: java-user@lucene.apache.org
> Subject: Re: Strange Fuzzyquery results scoring when using a low
> minimal distance
> 
> 
> Thanksa lot,
> But I still don't understand why raising a little bit the min
> similarity
> change the ordering...
> 
> 
> 
> markharw00d wrote:
> >
> > This could be down to IDF ie "Lucane" is ranked higher because it is
> rarer
> > despite having worse edit distance.
> > This is arguably a bug.
> > See http://issues.apache.org/jira/browse/LUCENE-329 which discusses
> this.
> > You could try subclass QueryParser and override newFuzzyQuery to
> return
> > FuzzyLikeThisQuery (found in "contrib/queries")
> >
> > Cheers
> > Mark
> >
> >
> >
> > ----- Original Message ----
> > From: stefcl <stefatwork@gmail.com>
> > To: java-user@lucene.apache.org
> > Sent: Mon, 15 February, 2010 14:13:52
> > Subject: Strange Fuzzyquery results scoring when using a low minimal
> > distance
> >
> >
> > Hello,
> >
> > I'm using Lucene v3.
> > Please consider the following spellings
> >
> > Lucene
> > Lucéne
> > lucéne
> > Lucane
> > Lucen
> >
> > When searching for "lucéne" among those words using a FuzzyQuery
> (with 0.5
> > edit distance), results show :
> >
> > 1. Lucene 1.0259752
> > 2. Lucane 1.0259752
> > 3. Lucéne 0.95660806
> > 4. lucéne 0.95660806
> > 5. Lucen 0.30779266
> >
> > #4 is an exact match, why does it receive a lower score than "Lucane"
> > which
> > contains one incorrect letter?
> >
> > Also, if you raise min similarity a bit higher (0.6 of above),
> everything
> > becomes normal :
> >
> > 1. Lucéne 1.0438477
> > 2. lucéne 1.0438477
> > 3. Lucene 0.97959816
> > 4. Lucane 0.97959816
> >
> >
> > Any idea?
> > Thanks in advance...
> >
> >
> > The code I use :
> >
> >    /**
> >      * @param args the command line arguments
> >      */
> >     public static void main(String[] args) throws IOException,
> > ParseException
> >     {
> >
> >         StandardAnalyzer analyzer = new
> > StandardAnalyzer(Version.LUCENE_CURRENT);
> >
> >         // TODO code application logic here
> >         Directory index = new RAMDirectory();
> >         IndexWriter w = new IndexWriter(index, analyzer, true,
> > IndexWriter.MaxFieldLength.UNLIMITED);
> >
> >         addDoc(w, "Lucene");
> >         addDoc(w, "Lucéne");
> >         addDoc(w, "lucéne");
> >         addDoc(w, "Lucane");
> >         addDoc(w, "Lucen");
> >
> >         w.close();
> >
> >         FuzzyQuery q =  new FuzzyQuery( new Term("title", "lucéne") ,
> 0.5f
> > );
> >
> >         // 3. search
> >         IndexSearcher searcher = new IndexSearcher(index);
> >
> >         TopDocs collector = searcher.search(q, 10);
> >         ScoreDoc[] hits = collector.scoreDocs;
> >
> >         // 4. display results
> >         System.out.println("Found " + hits.length + " hits.");
> >         for(int i = 0 ; i < hits.length; i++)
> >         {
> >               Document d = searcher.doc(hits[i].doc);
> >               System.out.println((i + 1) + ". " + d.get("title") + "
> " +
> > hits[i].score );
> >         }
> >
> >         // searcher can only be closed when there
> >         // is no need to access the documents any more.
> >         searcher.close();
> >     }
> >
> >
> >     private static void addDoc(IndexWriter w, String value) throws
> > IOException
> >     {
> >         Document doc = new Document();
> >         doc.add(new Field("title", value, Field.Store.YES,
> > Field.Index.ANALYZED));
> >         w.addDocument(doc);
> >     }
> > --
> > View this message in context:
> > http://old.nabble.com/Strange-Fuzzyquery-results-scoring-when-using-
> a-low-minimal-distance-tp27594371p27594371.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> 
> --
> View this message in context: http://old.nabble.com/Strange-Fuzzyquery-
> results-scoring-when-using-a-low-minimal-distance-
> tp27594371p27605395.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message