lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Brusic <i...@brusic.com>
Subject Re: Duplicate values in search
Date Thu, 31 Dec 2015 02:17:54 GMT
I potentially found the issue, but I am wondering why the code worked in
the first place. Did the contract for the scorer change with Lucene 5?

The issue was that underneath, each sub scorer had a posting enum and the
initial document was not consumed on the first pass.  Inside the
DefaultBulkScorer, you have:

int doc = scorer.docID();
...
return scoreRange(collector, scorer, twoPhase, acceptDocs, doc, max);

So the first document is retrieved outside of the custom scorer. Inside the
custom scorer base class, I had to add something like the code below to
consume that first document:

if (firstTime) {
    ...
    // new code
    for (Scorer scorer: subScorers) {
        if (scorer.docID() == initialDoc) {
            scorer.nextDoc();
        }
    }
    ...
}

I never wrote a custom scorer before (now that I see the power, I want to
write my own!), so I am not sure how the existing code worked in Lucene 4.
What I am confused is why does each subscorer need to consume their first
document before being used:

public void add(Scorer scorer) throws IOException {
    if (scorer.nextDoc() != NO_MORE_DOCS) { // Initialize and retain only
if it produces docs
        subScorers.add(scorer);
    }
}

The nextDoc() call advances the docBufferUpto pointer in the posting enum
to the second document. Code does not work without call nextDoc() initially
on each subscorer. Very confusing.

Although the existing unit test cases pass, I am still not confident about
the code. Will write a few more test cases, but ultimately why the code
exists in the first place and potentially replace it with base classes.

Ivan

On Tue, Dec 29, 2015 at 7:01 AM, Ivan Brusic <ivan@brusic.com> wrote:

> Thanks Adrien. I added the BaseScorer to the gist, but I was hoping to
> achieve was which direction I should go into to debug this issue. I was not
> focusing on the scorers since I did not need to upgrade them and I actually
> do not think I ever wrote my one Scorer in Lucene. Taking the next few days
> off, so I will get around to looking back into it soon.
>
> Ivan
>
> On Mon, Dec 28, 2015 at 5:41 PM, Adrien Grand <jpountz@gmail.com> wrote:
>
>> Ivan, I can't find the BaseScorer class in the gist. Maybe you forgot to
>> git add it?
>>
>> Le lun. 28 déc. 2015 à 23:07, Ivan Brusic <ivan@brusic.com> a écrit :
>>
>> > Here is the complete code:
>> > https://gist.github.com/brusic/e3018a2e403f5707fa3e
>> >
>> > The code is not originally mine, so I do not take responsibility. Once I
>> > get things to perform correctly, I will do another pass with
>> improvements.
>> > Much of the custom code needs to be re-thought.
>> >
>> > The scorer is one class that I did not need to update, so I did not
>> focus
>> > on it. Will do so now.
>> >
>> > Ivan
>> >
>> > On Mon, Dec 28, 2015 at 4:58 PM, Adrien Grand <jpountz@gmail.com>
>> wrote:
>> >
>> > > Hi Ivan,
>> > >
>> > > It looks like your scorer is emitting the same document twice. Maybe
>> you
>> > > could try to use AssertingIndexSearcher in your test case, this is the
>> > kind
>> > > of things that it should catch.
>> > >
>> > > The only related Lucene 5 change that I can think of is that Lucene
>> now
>> > > requires docs to be collected in order, did this scorer use to collect
>> > docs
>> > > out of order in Lucene 4?
>> > >
>> > > If that still doesn't help and if you can share the code of your
>> scorer,
>> > I
>> > > could give it a quick look.
>> > >
>> > > Le lun. 28 déc. 2015 à 22:18, Ivan Brusic <ivan@brusic.com> a écrit
:
>> > >
>> > > > I just migrated on ton of code from Lucene 4.10 to 5.4. Lots of
>> custom
>> > > > collectors, analyzers, queries, etc.. I have migrated other code
>> bases
>> > > from
>> > > > Lucene before (2->3, 3->4) and I always had one issue I could
not
>> > > eyeball!
>> > > >
>> > > > When using a custom query, I get the same document twice in the
>> result
>> > > set.
>> > > > The changes I made for the upgrade had to do with the query/weight
>> API
>> > > > change.
>> > > >
>> > > > Without getting in the custom code, here is the simple test case:
>> > > >
>> > > > @BeforeClass
>> > > > public static void buildIndex() throws IOException {
>> > > >     ANALYZER = new StandardAnalyzer();
>> > > >     IndexWriterConfig config = new IndexWriterConfig(ANALYZER);
>> > > >     DIRECTORY = new RAMDirectory();
>> > > >     try (IndexWriter writer = new IndexWriter(DIRECTORY, config))
{
>> > > >         // removed for brevity
>> > > >         // repeated five times with different values
>> > > >         Document doc = new Document();
>> > > >         doc.add(...);
>> > > >         writer.addDocument(doc);
>> > > >     }
>> > > > }
>> > > >
>> > > > @Test
>> > > > public void testQuery() throws IOException {
>> > > >     try (IndexReader reader = DirectoryReader.open(DIRECTORY)) {
>> > > >         IndexSearcher searcher = new IndexSearcher(reader);
>> > > >
>> > > >         PriorityQuery query = new PriorityQuery();
>> > > >         query.add(new TermQuery(new Term("foo", "xyz")));
>> > > >         query.add(new TermQuery(new Term("bar", "xyz")));
>> > > >         query.add(new TermQuery(new Term("baz", "xyz")));
>> > > >
>> > > >         CheckHits.checkDocIds("Invalid docs", new int[] {4, 2, 0,
>> 3},
>> > > > result.scoreDocs);
>> > > >
>> > > > }
>> > > >
>> > > > There should be four unique results out of five since the second
>> > > > document (docId 1) does not contain the term xyz. The results
>> instead
>> > > > contain 5 documents, with the first one repeated twice at the start:
>> > > >
>> > > > [doc=4 score=1.1976817 shardIndex=0, doc=4 score=1.1976817
>> > > > shardIndex=0, doc=2 score=0.63170385 shardIndex=0, doc=0
>> > > > score=0.37223506 shardIndex=0, doc=3 score=0.34156355 shardIndex=0]
>> > > >
>> > > > When using a BooleanQuery, the results are correct, so obviously the
>> > > > custom Query is failing somehow. In all my years of Lucene, I never
>> > > > had the same document twice. :) Without boring everyone with the
>> > > > custom code, what should I be looking for? Just cannot quite spot
>> it.
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Ivan
>> > > >
>> > >
>> >
>>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message