lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-user] Lucy Knowledge Questions
Date Thu, 23 Feb 2017 17:12:26 GMT
On 23/02/2017 01:07, kasilak wrote:
> (1)  You have mentioned above in your response, that Lexicon/Postings are
> stored as sorted on-disk array with binary search. Is the sorted array is
> represented as a binary tree (underlying data structure)? Is the binary
> search is binary tree search?

No, it's the classic binary search algorithm that works on a sorted array:

     https://en.wikipedia.org/wiki/Binary_search_algorithm

> Can I point me to "C" source code where this
> is done?

It's implemented in LexIndex_Seek:

 
https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/LexIndex.c;h=0d7a0ea4f4d4f2d720df13be2919691d0977c8ab;hb=HEAD#l141

> (2) While performing searches on a indexed cf.dat, is there a fixed cost to
> decompress the "delta encoded numbers and strings"?

What do you mean by "fixed cost"?

> (3) I am taking measurements using my toy benchmark. I have indexed 10,000
> documents. In the IxSearcher_Hits, I am changing the num_wanted for the
> given query string. My expectation was changing the num_wanted shouldn't
> affect the performance (measured using QPS "time taken to search and report
> the query in queries per second (QPS)").
> On the contrary what I am observing is when the num_wanted is set to 10
> hits, then the QPS is higher in comparison with num_wanted is set to 4000
> hits.
> If the TF/IDF (term frequency/inverse document frequency) ranking is
> constructed always and will report the top 10 hits or top 4000 hits,  then
> my expectation is QPS shouldn't change.
> "The only explanation I have is the difference in QPS is because in the case
> of 10 hits the search stops as soon as the IxSearcher_Hits found 10 hits
> (disregarding the ranking algorithm) and in the case of 4000 hits, it will
> continue and stop as soon as 4000 hits or maximum the application can find.
> This means IxSearcher_Hits doesn't use any inherent ranking to return the
> relevant documents. Is my theory right?"

No, IxSearcher_Hits always returns sorted documents. The search process can be 
divided into three steps:

1. Find all N documents matching a query.
2. Find and sort the top M documents according to the SortSpec. Lucy
    uses a priority queue implemented as a heap (HitQueue), so time
    complexity is O(N * log(M)).
3. Retrieve the stored fields for each of the top M documents.

Steps 2 and 3 are obviously faster for smaller values of M.

Nick


Mime
View raw message