lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adrien Grand (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-8788) Order LeafReaderContexts by Estimated Number Of Hits
Date Tue, 21 May 2019 12:44:00 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-8788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16844796#comment-16844796
] 

Adrien Grand commented on LUCENE-8788:
--------------------------------------

Do I get your idea right that your plan is to select multiple slices, but to collect them
sequentially rather than in parallel so collection of a slice can leverage information that
was gathered in previous slices? For instance in the case that a user wants the top 10 hits
sorted by a numeric field foo and that the 10th best hit has a value of 7 for field foo after
collecting the first slice, we could ignore documents whose value for the foo field is greater
than 7 for follow-up slices. And then we can order slices in the order that best suits us
since Lucene has no expectation regarding the order in which slices are collected, so we could
sort slices by increasing minimum (or maximum, or median) foo value.

This could be especially useful in the worst-case scenario that index order is inversely correlated
with sort order. For instance lots of users end up pushing logs to Lucene indices, and usually
more recent logs get higher doc IDs. So fetching the most recent logs hits the worst-case
scenario I mentioned in my previous sentence. Index sorting could help address this problem,
but these users often have lots of data and care about indexing rate, while index sorting
adds overhead to indexing.

A related idea that [~jimczi] mentioned to me would be to shuffle segments both at merge
time and when opening point-in-time views, in order to avoid ever having an index order that
is inversely correlated with sort order. Similarly to how one can avoid running into quicksort's
worst-case by shuffling the array first.

 

> Order LeafReaderContexts by Estimated Number Of Hits
> ----------------------------------------------------
>
>                 Key: LUCENE-8788
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8788
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Atri Sharma
>            Priority: Major
>
> We offer no guarantee on the order in which an IndexSearcher will look at segments during
a search operation. This can be improved for use cases where an engine using Lucene invokes
early termination and uses the partially collected hits. A better model would be if we sorted
segments by the estimated number of hits, thus increasing the probability of the overall relevance
of the returned partial results.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message