lucy-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kasi Lakshman Karthi Anbumony <kasi.anbum...@gmail.com>
Subject Re: [lucy-user] Lucy Benchmarking
Date Mon, 13 Feb 2017 23:57:59 GMT
Hi Murphy:

Thanks for your detailed explanation.

Given the significance of inverted index compression, can I know the
following for better understanding of inner workings:

(1) What is the data structure used to represent Lexicon? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)

(2) What is the data structure used to represent postings? (Clownfish
supports hashtable. Does it mean Lucy uses hashtable?)

(3) Which compression method is used? Is it enabled by default?

(4) Why there is no API (function call) to know the number of terms in
lexicon and posting list for a given cf.dat?

(3) Can I know whether searching through lexicon/posting list is in-memory
process or IO process?

Thanks
-Kasi


On Sat, Feb 11, 2017 at 1:30 PM, Marvin Humphrey <marvin@rectangular.com>
wrote:

> On Thu, Feb 9, 2017 at 3:51 PM, Kasi Lakshman Karthi Anbumony
> <kasi.anbumony@gmail.com> wrote:
>
> > As a follow on question, based on this link:
> > https://lucy.apache.org/docs/c/Lucy/Docs/FileFormat.html
> >
> > (1) Why the cf.dat has a document section?
>
> The search needs to give something back to you to identify which
> documents were hits. Lucy's internal document IDs change over time, so
> are not suitable for that purpose.  You need to at least store your
> own identifier, even if you choose not to store other parts of the
> document.
>
> > (2) Why is it not compressed?
>
> It's not done by default, but there are extension points allowing that
> behavior to be overridden. There's even example code which ships with
> Lucy which does exactly what you suggest.  It's in Perl, but could be
> ported to C.
>
>  $REPO/perl/lib/LucyX/Index/ZlibDocReader.pm
>  $REPO/perl/lib/LucyX/Index/ZlibDocWriter.pm
>
> > I see most of the content of the books I have indexed being part of
> cf.dat
> > file and can read the text as it is! Is this how the inverted indexing
> > works?
>
> The document storage part of a Lucy datastore is separate from the
> inverted index.  The inverted index data structures are definitely
> compressed, using algorithms tuned to the task of search. The first
> part of the search yields a set of internal Lucy document IDs, which
> are then used to look up whatever's in document storage.
>
> From a performance perspective, the cost to perform the inverted index
> search is roughly proportional to the size of the corpus, whereas the
> cost to retrieve the document content afterwards is proportional to
> the number of documents retrieved.  When scaling to larger
> collections, compressing the inverted index is more important than
> compressing document storage, since the number of documents searched
> grows while the number of documents retrieved often stays the same.
>
> Of course it may still be reasonable to compress document storage,
> depending on usage pattern. But if for example you're only storing
> short identifiers, there's no need.
>
> Marvin Humphrey
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message