lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Brusic <i...@brusic.com>
Subject Duplicate values in search
Date Mon, 28 Dec 2015 21:18:28 GMT
I just migrated on ton of code from Lucene 4.10 to 5.4. Lots of custom
collectors, analyzers, queries, etc.. I have migrated other code bases from
Lucene before (2->3, 3->4) and I always had one issue I could not eyeball!

When using a custom query, I get the same document twice in the result set.
The changes I made for the upgrade had to do with the query/weight API
change.

Without getting in the custom code, here is the simple test case:

@BeforeClass
public static void buildIndex() throws IOException {
    ANALYZER = new StandardAnalyzer();
    IndexWriterConfig config = new IndexWriterConfig(ANALYZER);
    DIRECTORY = new RAMDirectory();
    try (IndexWriter writer = new IndexWriter(DIRECTORY, config)) {
        // removed for brevity
        // repeated five times with different values
        Document doc = new Document();
        doc.add(...);
        writer.addDocument(doc);
    }
}

@Test
public void testQuery() throws IOException {
    try (IndexReader reader = DirectoryReader.open(DIRECTORY)) {
        IndexSearcher searcher = new IndexSearcher(reader);

        PriorityQuery query = new PriorityQuery();
        query.add(new TermQuery(new Term("foo", "xyz")));
        query.add(new TermQuery(new Term("bar", "xyz")));
        query.add(new TermQuery(new Term("baz", "xyz")));

        CheckHits.checkDocIds("Invalid docs", new int[] {4, 2, 0, 3},
result.scoreDocs);

}

There should be four unique results out of five since the second
document (docId 1) does not contain the term xyz. The results instead
contain 5 documents, with the first one repeated twice at the start:

[doc=4 score=1.1976817 shardIndex=0, doc=4 score=1.1976817
shardIndex=0, doc=2 score=0.63170385 shardIndex=0, doc=0
score=0.37223506 shardIndex=0, doc=3 score=0.34156355 shardIndex=0]

When using a BooleanQuery, the results are correct, so obviously the
custom Query is failing somehow. In all my years of Lucene, I never
had the same document twice. :) Without boring everyone with the
custom code, what should I be looking for? Just cannot quite spot it.

Cheers,

Ivan

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message