lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "AMIRAULT Martin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (LUCENE-7482) Faster sorted index search for reverse order search
Date Fri, 07 Oct 2016 02:42:20 GMT

     [ https://issues.apache.org/jira/browse/LUCENE-7482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

AMIRAULT Martin updated LUCENE-7482:
------------------------------------
    Description: 
We are currently using Lucene here in my company for our main product.
Our search functionnality is quite basic and the results are always sorted given a predefined
field. The user is only able to choose the sort order (Asc/Desc).

I am currently investigating using the index sort feature with EarlyTerminationSortingCollector.

This is quite a shame searching on a sorted index in reverse order do not have any optimization
and was wondering if it would be possible to make it faster by creating a special "ReverseSortingCollector"
for this purpose.

I am aware the posting list is designed to be always iterated in the same order, so it is
not about early-terminating the search but more about filtering-out unneeded documents more
efficiently.

If a segment is sorted in reverse order, we can work out easily the docId from which documents
should be collected.

Here is a sample quick code:

{code:title=ReverseSortingCollector.java|borderStyle=solid}
public class ReverseSortingCollector extends FilterCollector {

  /** Sort used to sort the search results */
  protected final Sort sort;
  /** Number of documents to collect in each segment */
  protected final int numDocsToCollect;
  
[...]

    @Override
    public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
        LeafReader reader = context.reader();
        Sort segmentSort = reader.getIndexSort();
        if (isReverseOrder(sort, segmentSort)) {//segment is sorted in reverse order than
the search sort
            
			//Here we can easily work out the docNum from which we should collect
			long collectFrom = context.reader().numDocs() - numDocsToCollect;
			
            return new FilterLeafCollector(in.getLeafCollector(context)) {
                @Override
                public void collect(int doc) throws IOException {
                    if (doc >= collectFrom) {//only delegates 
                        super.collect(doc);
                    }
                }
            };
        }else{
			return in.getLeafCollector(context);
		}
	}
	
}
{code}

This is specially efficient when used along with TopFieldCollector as a lot of docValue lookup
would not take place. 
In my experiment it reduced search time by 90%.

However I was wondering if it is correct, as my knowledge of Lucene is still quite limited.
Especially is it correct to assume that LeafReader docId always span from 0=>LeafReader.numDocs()
?


Note : Does not support paging. Could be eventually implemented by providing a way to look
up the docId to match from the last document collected (eg for LongPoint querying the docId
closest to the previously returned value...)



  was:
We are currently using Lucene here in my company for our main product.
Our search functionnality is quite basic and the results are always sorted given a predefined
field. The user is only able to choose the sort order (Asc/Desc).

I am currently investigating using the index sort feature with EarlyTerminationSortingCollector.

This is quite a shame searching on a sorted index in reverse order do not have any optimization
and was wondering if it would be possible to make it faster by creating a special "ReverseSortingCollector"
for this purpose.

I am aware the posting list is designed to be always iterated in the same order, so it is
not about early-terminating the search but more about filtering-out unneeded documents more
efficiently.

If a segment is sorted in reverse order, we can work out easily the docId from which documents
should be collected.

Here is a sample quick code:

{code:title=ReverseSortingCollector.java|borderStyle=solid}
public class ReverseSortingCollector extends FilterCollector {

  /** Sort used to sort the search results */
  protected final Sort sort;
  /** Number of documents to collect in each segment */
  protected final int numDocsToCollect;
  
[...]

    @Override
    public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException {
        LeafReader reader = context.reader();
        Sort segmentSort = reader.getIndexSort();
        if (isReverseOrder(sort, segmentSort)) {//segment is sorted in reverse order than
the search sort
            
			//Here we can easily work out the docNum from which we should collect
			long collectFrom = context.reader().numDocs() - numDocsToCollect;
			
            return new FilterLeafCollector(in.getLeafCollector(context)) {
                @Override
                public void collect(int doc) throws IOException {
                    if (doc >= collectFrom) {//only delegates 
                        super.collect(doc);
                    }
                }
            };
        }else{
			return in.getLeafCollector(context);
		}
	}
	
}
{code}

This is specially efficient when used along with TopFieldCollector as a lot of docValue lookup
would not take place. 
In my experiment it reduced search time by 90%.

However I was wondering if it is correct, as my knowledge of Lucene is still quite limited.
Especially is it correct to assume that LeafReader docId always span from 0->LeafReader.numDocs()
?


Note : Does not support paging. Could be eventually implemented by providing a way to look
up the docId to match from the last document collected (eg for LongPoint querying the docId
closest to the previously returned value...)




> Faster sorted index search for reverse order search
> ---------------------------------------------------
>
>                 Key: LUCENE-7482
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7482
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: AMIRAULT Martin
>            Priority: Minor
>
> We are currently using Lucene here in my company for our main product.
> Our search functionnality is quite basic and the results are always sorted given a predefined
field. The user is only able to choose the sort order (Asc/Desc).
> I am currently investigating using the index sort feature with EarlyTerminationSortingCollector.

> This is quite a shame searching on a sorted index in reverse order do not have any optimization
and was wondering if it would be possible to make it faster by creating a special "ReverseSortingCollector"
for this purpose.
> I am aware the posting list is designed to be always iterated in the same order, so it
is not about early-terminating the search but more about filtering-out unneeded documents
more efficiently.
> If a segment is sorted in reverse order, we can work out easily the docId from which
documents should be collected.
> Here is a sample quick code:
> {code:title=ReverseSortingCollector.java|borderStyle=solid}
> public class ReverseSortingCollector extends FilterCollector {
>   /** Sort used to sort the search results */
>   protected final Sort sort;
>   /** Number of documents to collect in each segment */
>   protected final int numDocsToCollect;
>   
> [...]
>     @Override
>     public LeafCollector getLeafCollector(LeafReaderContext context) throws IOException
{
>         LeafReader reader = context.reader();
>         Sort segmentSort = reader.getIndexSort();
>         if (isReverseOrder(sort, segmentSort)) {//segment is sorted in reverse order
than the search sort
>             
> 			//Here we can easily work out the docNum from which we should collect
> 			long collectFrom = context.reader().numDocs() - numDocsToCollect;
> 			
>             return new FilterLeafCollector(in.getLeafCollector(context)) {
>                 @Override
>                 public void collect(int doc) throws IOException {
>                     if (doc >= collectFrom) {//only delegates 
>                         super.collect(doc);
>                     }
>                 }
>             };
>         }else{
> 			return in.getLeafCollector(context);
> 		}
> 	}
> 	
> }
> {code}
> This is specially efficient when used along with TopFieldCollector as a lot of docValue
lookup would not take place. 
> In my experiment it reduced search time by 90%.
> However I was wondering if it is correct, as my knowledge of Lucene is still quite limited.
> Especially is it correct to assume that LeafReader docId always span from 0=>LeafReader.numDocs()
?
> Note : Does not support paging. Could be eventually implemented by providing a way to
look up the docId to match from the last document collected (eg for LongPoint querying the
docId closest to the previously returned value...)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message