lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Atul Bisaria <atul.bisa...@ericsson.com>
Subject RE: Increase search performance
Date Fri, 02 Feb 2018 03:45:54 GMT
Hi Adrien,

Please correct if I am wrong, but I believe using extended IntComparator in custom Sort object
for randomization would still score documents (using IndexSearcher.search(Query, int, Sort),
for example).

So I tried using a custom collector using IndexSearcher.search(Query, Collector) where the
custom collector does not score documents at all.

I have refactored RandomOrderCollector to fix the memory usage problem as described below.
Let me know if this looks ok now.

class RandomOrderCollector extends SimpleCollector
{
        private int maxHitsRequired;
        private int docBase;

        private ScoreDoc[] matches;

        private int numHits;

        private Random random = new Random();

        public RandomOrderCollector(int maxHitsRequired)
        {
                this.maxHitsRequired = maxHitsRequired;
                this.matches = new ScoreDoc[maxHitsRequired];
        }

        public boolean needsScores()
        {
                return false;
        }

        @Override
        public void collect(int doc) throws IOException
        {
                int absoluteDoc = docBase + doc;
                int randomScore = random.nextInt(); // assign a random score to each doc

                if(numHits < maxHitsRequired)
                {
                        matches[numHits++] = new ScoreDoc(absoluteDoc, randomScore);
                }
                else
                {
                        int index = random.nextInt(maxHitsRequired);
                        if(matches[index].score < randomScore)
                        {
                                matches[index] = new ScoreDoc(absoluteDoc, randomScore);;
                        }
                }
        }

        @Override
        protected void doSetNextReader(LeafReaderContext context) throws IOException
        {
                super.doSetNextReader(context);
                this.docBase = context.docBase;
        }

        public ScoreDoc[] getHits()
        {
                return matches;
        }
}

Best Regards,
Atul Bisaria

-----Original Message-----
From: Adrien Grand [mailto:jpountz@gmail.com]
Sent: Thursday, February 01, 2018 6:11 PM
To: java-user@lucene.apache.org
Subject: Re: Increase search performance

Yes, this collector won't perform well if you have many matches since memory usage is linear
with the number of matches. A better option would be to extend eg. IntComparator and implement
getNumericDocValues by returning a fake NumericDocValues instance that eg. does a bit mix
of the doc id and a per-request seed (for instance HPPC's BitMixer can do that https://github.com/carrotsearch/hppc/blob/master/hppc/src/main/java/com/carrotsearch/hppc/BitMixer.java
).

Le jeu. 1 févr. 2018 à 12:31, Atul Bisaria <atul.bisaria@ericsson.com> a écrit :

> Hi Adrien,
>
> Thanks for your reply.
>
> I have also tried testing with UsageTrackingQueryCachingPolicy, but
> did not observe a significant change in both latency and throughput.
>
> Given that I have specific search requirements of no scoring and
> sorting the search results in a random order (reason for custom sort
> object), I have also explored writing a custom collector and could
> observe quite a difference in latency figures.
>
> Let me know if this custom collector code has any loopholes which I
> could be missing:
>
> class RandomOrderCollector extends SimpleCollector {
>         private int maxHitsRequired;
>         private int docBase;
>
>         private List<Integer> matches = new ArrayList<Integer>();
>
>         public RandomOrderCollector(int maxHitsRequired)
>         {
>                 this.maxHitsRequired = maxHitsRequired;
>         }
>
>         public boolean needsScores()
>         {
>                 return false;
>         }
>
>         @Override
>         public void collect(int doc) throws IOException
>         {
>                 matches.add(docBase + doc);
>         }
>
>         @Override
>         protected void doSetNextReader(LeafReaderContext context)
> throws IOException
>         {
>                 super.doSetNextReader(context);
>                 this.docBase = context.docBase;
>         }
>
>         public List<Integer> getHits()
>         {
>                 Collections.shuffle(matches);
>                 maxHitsRequired = Math.min(matches.size(),
> maxHitsRequired);
>
>                 return matches.subList(0, maxHitsRequired);
>         }
> }
>
> Best Regards,
> Atul Bisaria
>
> -----Original Message-----
> From: Adrien Grand [mailto:jpountz@gmail.com]
> Sent: Wednesday, January 31, 2018 6:33 PM
> To: java-user@lucene.apache.org
> Subject: Re: Increase search performance
>
> Hi Atul,
>
>
> Le mar. 30 janv. 2018 à 16:24, Atul Bisaria
> <atul.bisaria@ericsson.com> a écrit :
>
> > 1.     Using ConstantScoreQuery so that scoring overhead is removed since
> > scoring is not required in my search use case. I also use a custom
> > Sort object which does not sort by score (see code below).
> >
>
> If you don't sort by score, then wrapping with a ConstantScoreQuery
> won't help as Lucene will figure out scores are not needed anyway.
>
>
> > 2.     Using query cache
> >
> >
> >
> > My understanding is that query cache would cache query results and
> > hence lead to significant increase in performance. Is this
> > understanding
> correct?
> >
>
> It depends what you mean by performance. If you are optimizing for
> worst-case latency, then the query cache might make things worse due
> to the fact that caching a query requires to visit all matches, while
> query execution can sometimes just skip over non-interesting matches
> (eg. in conjunctions).
>
> However if you are looking at improving throughput, then usually the
> default policy of the query cache of caching queries that look reused
> usually helps.
>
>
> > I am using Lucene version 5.4.1 where query cache seems to be
> > enabled by default
> > (https://issues.apache.org/jira/browse/LUCENE-6784), but I am not able to see any
significant change in search performance.
> >
>
>
>
>
> > Here is the code I am testing with:
> >
> >
> >
> > DirectoryReader reader = DirectoryReader.open(directory);      //using
> > MMapDirectory
> >
> > IndexSearcher searcher = new IndexSearcher(reader); //IndexReader
> > and IndexSearcher are created only once
> >
> > searcher.setQueryCachingPolicy(QueryCachingPolicy.ALWAYS_CACHE);
> >
>
> Don't do that, this will always cache all filters, which usually makes
> things slower for the reason mentioned above. I would rather advise
> that you use an instance of UsageTrackingQueryCachingPolicy.
>
Mime
View raw message