Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: java-user@lucene.apache.org
Received-SPF: pass (nike.apache.org: local policy includes SPF record at
 spf.trusted-forwarder.org)
MIME-Version: 1.0
In-Reply-To: 
 <CABGq8w-R1M9o2A6e4LsQFcSQjHUZB9CwX0ftffsVLauDhUk11Q@mail.gmail.com>
References: 
 <CABGq8w-R1M9o2A6e4LsQFcSQjHUZB9CwX0ftffsVLauDhUk11Q@mail.gmail.com>
From: Michael McCandless <lucene@mikemccandless.com>
Date: Mon, 23 Sep 2013 07:55:20 -0400
Message-ID: 
 <CAL8PwkYxFDjZf=+dEyHLERkhRp_Uq5rJ8rHJBoE2guHsFJ0fiQ@mail.gmail.com>
Subject: Re: Understanding FST Prefix & CheckIndex output
To: Lucene Users <java-user@lucene.apache.org>
Cc: manuel.lenormand@gmail.com
Content-Type: text/plain; charset=ISO-8859-1

On Sun, Sep 22, 2013 at 9:34 AM, Manuel Le Normand
<manuel.lenormand@gmail.com> wrote:
> Hi there,
> I try to deep dive into the inner LucenePostingFormat to check what might I
> do for improving query performance. I'm curious about the termBlock stats
> that I get from checkIndex -verbose.
>
> What does the followong mean:
> index FST bytes - the FST size, which is the field's partition of the .tip
> file?

Yes.

> num of terms - written 2M, although Luke interface shows me 8M, how come?

This should be the total unique term count for that field; not sure
why Luke disagrees.  Maybe Luke is printing Terms.sumTotalTermFreq =
total number of term occurrences (not unique terms).

> term / index FST bytes - summing up all my fields bytes doesn't get me
> close to the .tim / tip file, how come?

The .tim file also contains the postings metadata (file pointers into
.doc, .pos, skip length, etc.), so .tim file size will be larger than
the terms + index bytes as reported by CheckIndex.

> blocks - these are the SUFFIX blocks (.tim files), which are implemented as
> Burst Tries, right?

Yes; this is basically the number of nodes in the prefix trie, where
each node has a block of terms' suffixes, plus their metadata
(docFreq, totalTermFreq) plus the postings metadata.

> block types - where can I get the info about these different types?

A terms-only block is a "leaf" node in the prefix trie: it has no
child sub-blocks, so all of its entries are terms.

A sub-clock only block is a non-leaf node that has only pointers to
other blocks, i.e. it contains no terms itself.

A mixed block is all other blocks: they contain at least one term and
at least one sub-block.  The total number of blocks is equal to the
sum of these three counts.

A floor block is orthogonal to the above breakdown, and is used when a
single block would contain too many entries.  This is necessary
because the problem of dividing up the terms into blocks of size 25 -
48 terms or sub-blocks is over-constrained.  Since the labels are
bytes, in the worst case you could have 255 entries in a block; when
this happens, we split the block into an initial floor block, followed
by a number of "sub-floor" blocks.  When seeking, the FST points to
each of these floor / sub-floor blocks so e.g. if you are trying to
get to byte 255 at this point, you don't have to scan through the
first 255 entries to get there.

> As background, my main performance issue is (random?) read miss IO while
> looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
> queries, so my optimization is avoiding IO's.

Maybe try the new in-memory terms dict?  It holds all terms + metadata
in RAM so there's no seeking when looking up the term.  But the
postings are still on disk, so seeking is necessary there.  This is
currently trunk only, FSTTerms* under lucene/codec.

You could also try MemoryPostingsFormat, which holds all terms +
postings in RAM.

But you need to have enough RAM vs your index size for the fields, for these :)

> That said, is there any
> reason getting the right block will require more than <segment_count> IO
> (of 4kB)?

In the worst case it's 2 seeks per term: first the FST tells us which
block to load from the .tim file, so we seek + scan to it.  Then,
assuming you need to visit the documents for this term, it's another
seek to get to the right block for the postings, unless there was only
1 doc for this term and you indexed DOCS_ONLY, in which case that
docID is inlined into the term metadata (so only one seek).

If you also need positions, then that will be another seek; if
payloads/offsets, another seek.

> Does a certain distribution of prefix length of block types should alarm me
> in some way?
>
> field "text_txt"
>   index FST:
>     18300 nodes
>     45779 arc
>     583438 bytes
>   term:
>     2053393 terms
>     25597203 bytes (12.5 bytes/term)
>   blocks:
>     66086 blocks
>     51870 terms-only blocks
>     47 sub-block-only blocks
>     14169 mixed blocks
>     13599 floor blocks
>     22862 non-floor blocks
>     43224 floor sub-blcoks
>     18289568 term suffix bytes (276.8 suffix-bytes/block)
>     4174480 term stas bytes (63.2 stats-bytes/block)
>     7632796 other bytes (115.5 stats-bytes/block)
>     by prefix length:
>       0: 1
>       1: 683
>       2: 10782
>       3. 17133
>
> etc...

There are term distributions that are "better", but there should be no
"adversary" case that cripples the performance.

E.g. if you use UIDs, the best performance should be 0-prefixed base-N
ints, where N is over 25 and well below 48, is best, and they are
sequentially assigned.  Random UIDs are worst!

But, no distribution by prefix length should be "alarming" ...

Mike McCandless

http://blog.mikemccandless.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org