lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Wellnhofer <wellnho...@aevum.de>
Subject Re: [lucy-user] Lucy Benchmarking
Date Tue, 14 Feb 2017 13:47:14 GMT
On 14/02/2017 00:57, Kasi Lakshman Karthi Anbumony wrote:
> (1) What is the data structure used to represent Lexicon? (Clownfish
> supports hashtable. Does it mean Lucy uses hashtable?)

Lexicon is essentially a sorted on-disk array that is searched with binary 
search. Clownfish::Hash, on the other hand, is an in-memory data structure. 
Lucy doesn't build in-memory structures for most index data because this would 
incur a huge startup penalty. This also makes it possible to work with indices 
that don't fit in RAM, although performance deteriorates quickly in this case.

> (2) What is the data structure used to represent postings? (Clownfish
> supports hashtable. Does it mean Lucy uses hashtable?)

Posting lists are stored in an on-disk array. The indices are found in Lexicon.

> (3) Which compression method is used? Is it enabled by default?

Lexicon and posting list data is always compressed with delta encoding for 
numbers and incremental encoding for strings.

> (4) Why there is no API (function call) to know the number of terms in
> lexicon and posting list for a given cf.dat?

It's generally hard to tell why a certain feature wasn't implemented. The only 
answer I can give is that no one deemed it important enough so far. But Lucy 
is open-source software. So, basically, anyone can implement any features they 
want.

> (3) Can I know whether searching through lexicon/posting list is in-memory
> process or IO process?

Lucy uses memory-mapped files to access most index data so the distinction 
between in-memory and IO-based operation blurs quite a bit.

Nick


Mime
View raw message