On 23/02/2017 01:07, kasilak wrote:
> (1) You have mentioned above in your response, that Lexicon/Postings are
> stored as sorted ondisk array with binary search. Is the sorted array is
> represented as a binary tree (underlying data structure)? Is the binary
> search is binary tree search?
No, it's the classic binary search algorithm that works on a sorted array:
https://en.wikipedia.org/wiki/Binary_search_algorithm
> Can I point me to "C" source code where this
> is done?
It's implemented in LexIndex_Seek:
https://git1uswest.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/LexIndex.c;h=0d7a0ea4f4d4f2d720df13be2919691d0977c8ab;hb=HEAD#l141
> (2) While performing searches on a indexed cf.dat, is there a fixed cost to
> decompress the "delta encoded numbers and strings"?
What do you mean by "fixed cost"?
> (3) I am taking measurements using my toy benchmark. I have indexed 10,000
> documents. In the IxSearcher_Hits, I am changing the num_wanted for the
> given query string. My expectation was changing the num_wanted shouldn't
> affect the performance (measured using QPS "time taken to search and report
> the query in queries per second (QPS)").
> On the contrary what I am observing is when the num_wanted is set to 10
> hits, then the QPS is higher in comparison with num_wanted is set to 4000
> hits.
> If the TF/IDF (term frequency/inverse document frequency) ranking is
> constructed always and will report the top 10 hits or top 4000 hits, then
> my expectation is QPS shouldn't change.
> "The only explanation I have is the difference in QPS is because in the case
> of 10 hits the search stops as soon as the IxSearcher_Hits found 10 hits
> (disregarding the ranking algorithm) and in the case of 4000 hits, it will
> continue and stop as soon as 4000 hits or maximum the application can find.
> This means IxSearcher_Hits doesn't use any inherent ranking to return the
> relevant documents. Is my theory right?"
No, IxSearcher_Hits always returns sorted documents. The search process can be
divided into three steps:
1. Find all N documents matching a query.
2. Find and sort the top M documents according to the SortSpec. Lucy
uses a priority queue implemented as a heap (HitQueue), so time
complexity is O(N * log(M)).
3. Retrieve the stored fields for each of the top M documents.
Steps 2 and 3 are obviously faster for smaller values of M.
Nick
