lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: Duplicate values in search
Date Thu, 31 Dec 2015 09:08:05 GMT
Hi,

from the comments around your original code I have seen that there were instructions "how
to update the code" (instructions what you need to copy from Lucene and what to change to
create Base* classes). Maybe you missed to copypaste all files. In Lucene 5 there was a major
rewrite also of Dismax/Dissum scorers.

In older Lucene versions there was an early exit condition outside of the scorer in the weight
while creating the disjunction. Before initializing the Disjunction scorer it asked every
sub scorer for the first doc - causing it to sit on first doc already. Every scorer that had
no first doc (empty one) was not added to the main disjunction scorer anymore. Because of
this the logic in the main disjunction scorer was a bit  non-natural, because the subscoreres
were all initialized - so Dis*Scorer did not call next() on the first round.

This changed in 5.x and trunk during the code rewrite. The check for empty scorers is done
in a different way now (as you said in your previous mail), so be sure to copy paste really
all classes. Because the scorer and weights are not a public classes in Lucene it is free
to change its internal behaviour.

> The condition to execute a scoreAll verses a scoreRange is more stringent
> in Lucene 5.4.  In my case, scorer.docID() is not equal to -1 at the start,
> so the code path will execute a scoreRange instead, which does not provide
> the same behavior as scoreAll.

It was always required to be -1 at start before the first call to next()/advance() - see documentation
of DocIdSetIterator/Scorer. But in the older Disjunction code never checked for this. AssertingScorer
from test framework checks all these conditions. The new logic about the pre-check for short
circuiting the disjunction creating is now different and relies on "correct" behaviour of
all (sub-)scorers.

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Ivan Brusic [mailto:ivan@brusic.com]
> Sent: Thursday, December 31, 2015 3:18 AM
> To: java-user@lucene.apache.org
> Subject: Re: Duplicate values in search
> 
> I potentially found the issue, but I am wondering why the code worked in
> the first place. Did the contract for the scorer change with Lucene 5?
> 
> The issue was that underneath, each sub scorer had a posting enum and the
> initial document was not consumed on the first pass.  Inside the
> DefaultBulkScorer, you have:
> 
> int doc = scorer.docID();
> ...
> return scoreRange(collector, scorer, twoPhase, acceptDocs, doc, max);
> 
> So the first document is retrieved outside of the custom scorer. Inside the
> custom scorer base class, I had to add something like the code below to
> consume that first document:
> 
> if (firstTime) {
>     ...
>     // new code
>     for (Scorer scorer: subScorers) {
>         if (scorer.docID() == initialDoc) {
>             scorer.nextDoc();
>         }
>     }
>     ...
> }
> 
> I never wrote a custom scorer before (now that I see the power, I want to
> write my own!), so I am not sure how the existing code worked in Lucene 4.
> What I am confused is why does each subscorer need to consume their first
> document before being used:
> 
> public void add(Scorer scorer) throws IOException {
>     if (scorer.nextDoc() != NO_MORE_DOCS) { // Initialize and retain only
> if it produces docs
>         subScorers.add(scorer);
>     }
> }
> 
> The nextDoc() call advances the docBufferUpto pointer in the posting enum
> to the second document. Code does not work without call nextDoc() initially
> on each subscorer. Very confusing.
> 
> Although the existing unit test cases pass, I am still not confident about
> the code. Will write a few more test cases, but ultimately why the code
> exists in the first place and potentially replace it with base classes.
> 
> Ivan
> 
> On Tue, Dec 29, 2015 at 7:01 AM, Ivan Brusic <ivan@brusic.com> wrote:
> 
> > Thanks Adrien. I added the BaseScorer to the gist, but I was hoping to
> > achieve was which direction I should go into to debug this issue. I was not
> > focusing on the scorers since I did not need to upgrade them and I actually
> > do not think I ever wrote my one Scorer in Lucene. Taking the next few
> days
> > off, so I will get around to looking back into it soon.
> >
> > Ivan
> >
> > On Mon, Dec 28, 2015 at 5:41 PM, Adrien Grand <jpountz@gmail.com>
> wrote:
> >
> >> Ivan, I can't find the BaseScorer class in the gist. Maybe you forgot to
> >> git add it?
> >>
> >> Le lun. 28 déc. 2015 à 23:07, Ivan Brusic <ivan@brusic.com> a écrit
:
> >>
> >> > Here is the complete code:
> >> > https://gist.github.com/brusic/e3018a2e403f5707fa3e
> >> >
> >> > The code is not originally mine, so I do not take responsibility. Once
I
> >> > get things to perform correctly, I will do another pass with
> >> improvements.
> >> > Much of the custom code needs to be re-thought.
> >> >
> >> > The scorer is one class that I did not need to update, so I did not
> >> focus
> >> > on it. Will do so now.
> >> >
> >> > Ivan
> >> >
> >> > On Mon, Dec 28, 2015 at 4:58 PM, Adrien Grand <jpountz@gmail.com>
> >> wrote:
> >> >
> >> > > Hi Ivan,
> >> > >
> >> > > It looks like your scorer is emitting the same document twice. Maybe
> >> you
> >> > > could try to use AssertingIndexSearcher in your test case, this is
the
> >> > kind
> >> > > of things that it should catch.
> >> > >
> >> > > The only related Lucene 5 change that I can think of is that Lucene
> >> now
> >> > > requires docs to be collected in order, did this scorer use to collect
> >> > docs
> >> > > out of order in Lucene 4?
> >> > >
> >> > > If that still doesn't help and if you can share the code of your
> >> scorer,
> >> > I
> >> > > could give it a quick look.
> >> > >
> >> > > Le lun. 28 déc. 2015 à 22:18, Ivan Brusic <ivan@brusic.com>
a écrit :
> >> > >
> >> > > > I just migrated on ton of code from Lucene 4.10 to 5.4. Lots
of
> >> custom
> >> > > > collectors, analyzers, queries, etc.. I have migrated other code
> >> bases
> >> > > from
> >> > > > Lucene before (2->3, 3->4) and I always had one issue I
could not
> >> > > eyeball!
> >> > > >
> >> > > > When using a custom query, I get the same document twice in the
> >> result
> >> > > set.
> >> > > > The changes I made for the upgrade had to do with the
> query/weight
> >> API
> >> > > > change.
> >> > > >
> >> > > > Without getting in the custom code, here is the simple test case:
> >> > > >
> >> > > > @BeforeClass
> >> > > > public static void buildIndex() throws IOException {
> >> > > >     ANALYZER = new StandardAnalyzer();
> >> > > >     IndexWriterConfig config = new IndexWriterConfig(ANALYZER);
> >> > > >     DIRECTORY = new RAMDirectory();
> >> > > >     try (IndexWriter writer = new IndexWriter(DIRECTORY, config))
{
> >> > > >         // removed for brevity
> >> > > >         // repeated five times with different values
> >> > > >         Document doc = new Document();
> >> > > >         doc.add(...);
> >> > > >         writer.addDocument(doc);
> >> > > >     }
> >> > > > }
> >> > > >
> >> > > > @Test
> >> > > > public void testQuery() throws IOException {
> >> > > >     try (IndexReader reader = DirectoryReader.open(DIRECTORY))
{
> >> > > >         IndexSearcher searcher = new IndexSearcher(reader);
> >> > > >
> >> > > >         PriorityQuery query = new PriorityQuery();
> >> > > >         query.add(new TermQuery(new Term("foo", "xyz")));
> >> > > >         query.add(new TermQuery(new Term("bar", "xyz")));
> >> > > >         query.add(new TermQuery(new Term("baz", "xyz")));
> >> > > >
> >> > > >         CheckHits.checkDocIds("Invalid docs", new int[] {4, 2,
0,
> >> 3},
> >> > > > result.scoreDocs);
> >> > > >
> >> > > > }
> >> > > >
> >> > > > There should be four unique results out of five since the second
> >> > > > document (docId 1) does not contain the term xyz. The results
> >> instead
> >> > > > contain 5 documents, with the first one repeated twice at the
start:
> >> > > >
> >> > > > [doc=4 score=1.1976817 shardIndex=0, doc=4 score=1.1976817
> >> > > > shardIndex=0, doc=2 score=0.63170385 shardIndex=0, doc=0
> >> > > > score=0.37223506 shardIndex=0, doc=3 score=0.34156355
> shardIndex=0]
> >> > > >
> >> > > > When using a BooleanQuery, the results are correct, so obviously
> the
> >> > > > custom Query is failing somehow. In all my years of Lucene, I
never
> >> > > > had the same document twice. :) Without boring everyone with
the
> >> > > > custom code, what should I be looking for? Just cannot quite
spot
> >> it.
> >> > > >
> >> > > > Cheers,
> >> > > >
> >> > > > Ivan
> >> > > >
> >> > >
> >> >
> >>
> >
> >


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message