lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ivan Brusic <i...@brusic.com>
Subject Re: Duplicate values in search
Date Mon, 04 Jan 2016 16:37:57 GMT
Thanks Uwe,

Reminds me of upgrading to Lucene 4 on another project where the
TokenStream contract was now being fully enforced. Lots of fixing on that
one!

As troublesome is this upgrade is, all the changes do make sense. Great
work on Lucene 5. The scorer code does have "upgrade" instructions, but I
was not aware of it since that code compiled without needing any API
changes. From what I can tell, that code actually differs a lot from
wherever it was copied from in the first place.

Cheers,

Ivan

On Thu, Dec 31, 2015 at 1:08 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi,
>
> from the comments around your original code I have seen that there were
> instructions "how to update the code" (instructions what you need to copy
> from Lucene and what to change to create Base* classes). Maybe you missed
> to copypaste all files. In Lucene 5 there was a major rewrite also of
> Dismax/Dissum scorers.
>
> In older Lucene versions there was an early exit condition outside of the
> scorer in the weight while creating the disjunction. Before initializing
> the Disjunction scorer it asked every sub scorer for the first doc -
> causing it to sit on first doc already. Every scorer that had no first doc
> (empty one) was not added to the main disjunction scorer anymore. Because
> of this the logic in the main disjunction scorer was a bit  non-natural,
> because the subscoreres were all initialized - so Dis*Scorer did not call
> next() on the first round.
>
> This changed in 5.x and trunk during the code rewrite. The check for empty
> scorers is done in a different way now (as you said in your previous mail),
> so be sure to copy paste really all classes. Because the scorer and weights
> are not a public classes in Lucene it is free to change its internal
> behaviour.
>
> > The condition to execute a scoreAll verses a scoreRange is more stringent
> > in Lucene 5.4.  In my case, scorer.docID() is not equal to -1 at the
> start,
> > so the code path will execute a scoreRange instead, which does not
> provide
> > the same behavior as scoreAll.
>
> It was always required to be -1 at start before the first call to
> next()/advance() - see documentation of DocIdSetIterator/Scorer. But in the
> older Disjunction code never checked for this. AssertingScorer from test
> framework checks all these conditions. The new logic about the pre-check
> for short circuiting the disjunction creating is now different and relies
> on "correct" behaviour of all (sub-)scorers.
>
> Uwe
>
> -----
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Ivan Brusic [mailto:ivan@brusic.com]
> > Sent: Thursday, December 31, 2015 3:18 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Duplicate values in search
> >
> > I potentially found the issue, but I am wondering why the code worked in
> > the first place. Did the contract for the scorer change with Lucene 5?
> >
> > The issue was that underneath, each sub scorer had a posting enum and the
> > initial document was not consumed on the first pass.  Inside the
> > DefaultBulkScorer, you have:
> >
> > int doc = scorer.docID();
> > ...
> > return scoreRange(collector, scorer, twoPhase, acceptDocs, doc, max);
> >
> > So the first document is retrieved outside of the custom scorer. Inside
> the
> > custom scorer base class, I had to add something like the code below to
> > consume that first document:
> >
> > if (firstTime) {
> >     ...
> >     // new code
> >     for (Scorer scorer: subScorers) {
> >         if (scorer.docID() == initialDoc) {
> >             scorer.nextDoc();
> >         }
> >     }
> >     ...
> > }
> >
> > I never wrote a custom scorer before (now that I see the power, I want to
> > write my own!), so I am not sure how the existing code worked in Lucene
> 4.
> > What I am confused is why does each subscorer need to consume their first
> > document before being used:
> >
> > public void add(Scorer scorer) throws IOException {
> >     if (scorer.nextDoc() != NO_MORE_DOCS) { // Initialize and retain only
> > if it produces docs
> >         subScorers.add(scorer);
> >     }
> > }
> >
> > The nextDoc() call advances the docBufferUpto pointer in the posting enum
> > to the second document. Code does not work without call nextDoc()
> initially
> > on each subscorer. Very confusing.
> >
> > Although the existing unit test cases pass, I am still not confident
> about
> > the code. Will write a few more test cases, but ultimately why the code
> > exists in the first place and potentially replace it with base classes.
> >
> > Ivan
> >
> > On Tue, Dec 29, 2015 at 7:01 AM, Ivan Brusic <ivan@brusic.com> wrote:
> >
> > > Thanks Adrien. I added the BaseScorer to the gist, but I was hoping to
> > > achieve was which direction I should go into to debug this issue. I
> was not
> > > focusing on the scorers since I did not need to upgrade them and I
> actually
> > > do not think I ever wrote my one Scorer in Lucene. Taking the next few
> > days
> > > off, so I will get around to looking back into it soon.
> > >
> > > Ivan
> > >
> > > On Mon, Dec 28, 2015 at 5:41 PM, Adrien Grand <jpountz@gmail.com>
> > wrote:
> > >
> > >> Ivan, I can't find the BaseScorer class in the gist. Maybe you forgot
> to
> > >> git add it?
> > >>
> > >> Le lun. 28 déc. 2015 à 23:07, Ivan Brusic <ivan@brusic.com> a écrit
:
> > >>
> > >> > Here is the complete code:
> > >> > https://gist.github.com/brusic/e3018a2e403f5707fa3e
> > >> >
> > >> > The code is not originally mine, so I do not take responsibility.
> Once I
> > >> > get things to perform correctly, I will do another pass with
> > >> improvements.
> > >> > Much of the custom code needs to be re-thought.
> > >> >
> > >> > The scorer is one class that I did not need to update, so I did not
> > >> focus
> > >> > on it. Will do so now.
> > >> >
> > >> > Ivan
> > >> >
> > >> > On Mon, Dec 28, 2015 at 4:58 PM, Adrien Grand <jpountz@gmail.com>
> > >> wrote:
> > >> >
> > >> > > Hi Ivan,
> > >> > >
> > >> > > It looks like your scorer is emitting the same document twice.
> Maybe
> > >> you
> > >> > > could try to use AssertingIndexSearcher in your test case, this
> is the
> > >> > kind
> > >> > > of things that it should catch.
> > >> > >
> > >> > > The only related Lucene 5 change that I can think of is that
> Lucene
> > >> now
> > >> > > requires docs to be collected in order, did this scorer use to
> collect
> > >> > docs
> > >> > > out of order in Lucene 4?
> > >> > >
> > >> > > If that still doesn't help and if you can share the code of your
> > >> scorer,
> > >> > I
> > >> > > could give it a quick look.
> > >> > >
> > >> > > Le lun. 28 déc. 2015 à 22:18, Ivan Brusic <ivan@brusic.com>
a
> écrit :
> > >> > >
> > >> > > > I just migrated on ton of code from Lucene 4.10 to 5.4.
Lots of
> > >> custom
> > >> > > > collectors, analyzers, queries, etc.. I have migrated other
code
> > >> bases
> > >> > > from
> > >> > > > Lucene before (2->3, 3->4) and I always had one issue
I could
> not
> > >> > > eyeball!
> > >> > > >
> > >> > > > When using a custom query, I get the same document twice
in the
> > >> result
> > >> > > set.
> > >> > > > The changes I made for the upgrade had to do with the
> > query/weight
> > >> API
> > >> > > > change.
> > >> > > >
> > >> > > > Without getting in the custom code, here is the simple test
> case:
> > >> > > >
> > >> > > > @BeforeClass
> > >> > > > public static void buildIndex() throws IOException {
> > >> > > >     ANALYZER = new StandardAnalyzer();
> > >> > > >     IndexWriterConfig config = new IndexWriterConfig(ANALYZER);
> > >> > > >     DIRECTORY = new RAMDirectory();
> > >> > > >     try (IndexWriter writer = new IndexWriter(DIRECTORY,
> config)) {
> > >> > > >         // removed for brevity
> > >> > > >         // repeated five times with different values
> > >> > > >         Document doc = new Document();
> > >> > > >         doc.add(...);
> > >> > > >         writer.addDocument(doc);
> > >> > > >     }
> > >> > > > }
> > >> > > >
> > >> > > > @Test
> > >> > > > public void testQuery() throws IOException {
> > >> > > >     try (IndexReader reader = DirectoryReader.open(DIRECTORY))
{
> > >> > > >         IndexSearcher searcher = new IndexSearcher(reader);
> > >> > > >
> > >> > > >         PriorityQuery query = new PriorityQuery();
> > >> > > >         query.add(new TermQuery(new Term("foo", "xyz")));
> > >> > > >         query.add(new TermQuery(new Term("bar", "xyz")));
> > >> > > >         query.add(new TermQuery(new Term("baz", "xyz")));
> > >> > > >
> > >> > > >         CheckHits.checkDocIds("Invalid docs", new int[]
{4, 2,
> 0,
> > >> 3},
> > >> > > > result.scoreDocs);
> > >> > > >
> > >> > > > }
> > >> > > >
> > >> > > > There should be four unique results out of five since the
second
> > >> > > > document (docId 1) does not contain the term xyz. The results
> > >> instead
> > >> > > > contain 5 documents, with the first one repeated twice at
the
> start:
> > >> > > >
> > >> > > > [doc=4 score=1.1976817 shardIndex=0, doc=4 score=1.1976817
> > >> > > > shardIndex=0, doc=2 score=0.63170385 shardIndex=0, doc=0
> > >> > > > score=0.37223506 shardIndex=0, doc=3 score=0.34156355
> > shardIndex=0]
> > >> > > >
> > >> > > > When using a BooleanQuery, the results are correct, so obviously
> > the
> > >> > > > custom Query is failing somehow. In all my years of Lucene,
I
> never
> > >> > > > had the same document twice. :) Without boring everyone
with the
> > >> > > > custom code, what should I be looking for? Just cannot quite
> spot
> > >> it.
> > >> > > >
> > >> > > > Cheers,
> > >> > > >
> > >> > > > Ivan
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message