lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: [lucy-user] Lucy Benchmarking
Date Sat, 11 Feb 2017 18:30:06 GMT
On Thu, Feb 9, 2017 at 3:51 PM, Kasi Lakshman Karthi Anbumony
<kasi.anbumony@gmail.com> wrote:

> As a follow on question, based on this link:
> https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html
>
> (1) Why the cf.dat has a document section?

The search needs to give something back to you to identify which
documents were hits. Lucy's internal document IDs change over time, so
are not suitable for that purpose.  You need to at least store your
own identifier, even if you choose not to store other parts of the
document.

> (2) Why is it not compressed?

It's not done by default, but there are extension points allowing that
behavior to be overridden. There's even example code which ships with
Lucy which does exactly what you suggest.  It's in Perl, but could be
ported to C.

 $REPO/perl/lib/LucyX/Index/ZlibDocReader.pm
 $REPO/perl/lib/LucyX/Index/ZlibDocWriter.pm

> I see most of the content of the books I have indexed being part of cf.dat
> file and can read the text as it is! Is this how the inverted indexing
> works?

The document storage part of a Lucy datastore is separate from the
inverted index.  The inverted index data structures are definitely
compressed, using algorithms tuned to the task of search. The first
part of the search yields a set of internal Lucy document IDs, which
are then used to look up whatever's in document storage.

>From a performance perspective, the cost to perform the inverted index
search is roughly proportional to the size of the corpus, whereas the
cost to retrieve the document content afterwards is proportional to
the number of documents retrieved.  When scaling to larger
collections, compressing the inverted index is more important than
compressing document storage, since the number of documents searched
grows while the number of documents retrieved often stays the same.

Of course it may still be reasonable to compress document storage,
depending on usage pattern. But if for example you're only storing
short identifiers, there's no need.

Marvin Humphrey

Mime
View raw message