lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Barry (JIRA)" <j...@apache.org>
Subject [jira] [Created] (LUCENE-4559) PerFieldSimilarityWrapper
Date Wed, 14 Nov 2012 19:10:13 GMT
Joel Barry created LUCENE-4559:
----------------------------------

             Summary: PerFieldSimilarityWrapper
                 Key: LUCENE-4559
                 URL: https://issues.apache.org/jira/browse/LUCENE-4559
             Project: Lucene - Core
          Issue Type: Improvement
    Affects Versions: 4.0
            Reporter: Joel Barry
            Priority: Minor


This issue requests that documentation be clarified for the current
behavior of queryNorm() and coord() on PerFieldAnalyzerWrapper and
that support is added for the use case described below.

The documentation for PerFieldAnalyzerWrapper (lucene 4.0) says:

{noformat}
  Subclasses should implement get(String) to return an appropriate
  Similarity (for example, using field-specific parameter values) for
  the field.
{noformat}

This is misleading because of the behavior for queryNorm() and
coord().  The Similarity returned from get() is not accessed for these
methods.  Instead, the PerFieldAnalyzerWrapper subclass methods are
called.  I understand that this is because these methods apply to the
query as a whole rather than per field.  However, consider the
following.  A PerFieldAnalyzerWrapper with no per-field behavior (just
returns DefaultSimilarity in get()) behaves differently than
DefaultSimilarity itself:

{noformat}
class MyPerFieldSimilarity1 extends PerFieldSimilarityWrapper {
    @Override
    public Similarity get(String name) {
        return new DefaultSimilarity();
    }
}

public class PerFieldSimilarityWrapperTest {    
    private float runQuery(Similarity similarity) throws IOException {
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, new WhitespaceAnalyzer(Version.LUCENE_40));
        config.setSimilarity(similarity);
        Directory dir = new RAMDirectory();
        IndexWriter writer = new IndexWriter(dir, config);
        Document doc = new Document();
        doc.add(new TextField("A-field", "first", Store.YES));
        writer.addDocument(doc);
        writer.commit();
        
        IndexReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);
        searcher.setSimilarity(similarity);
        TermQuery query = new TermQuery(new Term("A-field", "first"));
        TopDocs topDocs = searcher.search(query, 1);
        return topDocs.scoreDocs[0].score;
    }
    
    @Test
    public void testSimple() throws Exception {
        float score1 = runQuery(new DefaultSimilarity());
        float score2 = runQuery(new MyPerFieldSimilarity1());
        assertEquals(score1, score2, 0.0001);
	// java.lang.AssertionError:
	//   expected:<0.3068528175354004> but was:<0.09415864944458008>
    }
{noformat}

One solution is to override and forward, e.g.

{noformat}
class MyPerFieldSimilarity1 extends PerFieldSimilarityWrapper {
    @Override
    public Similarity get(String name) {
        return new DefaultSimilarity();
    }
    @Override
    public float coord(int overlap, int maxOverlap) {
        return get("dummy").coord(overlap, maxOverlap);
    }
    @Override
    public float queryNorm(float valueForNormalization) {
        return get("dummy").queryNorm(valueForNormalization);
    }
}
{noformat}

However, these methods don't have access to query field data, thus the
"dummy" argument.

Suppose an application arranges documents so that there are two
distinct field groupings:

{noformat}
Document:
  A-field1
  A-field2
  A-field3
  B-field1
  B-field2
  B-field3
{noformat}

The application creates queries that use the A fields, or the B
fields, but never both A and B in the same query.  Then it seems
reasonable that PerFieldAnalyzerWrapper should provide a way for
queryNorm() and coord() to operate on these sets of fields.  This
cannot be done with the current implementation.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message