lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael McCandless <>
Subject Re: Understanding FST Prefix & CheckIndex output
Date Mon, 23 Sep 2013 11:55:20 GMT
On Sun, Sep 22, 2013 at 9:34 AM, Manuel Le Normand
<> wrote:
> Hi there,
> I try to deep dive into the inner LucenePostingFormat to check what might I
> do for improving query performance. I'm curious about the termBlock stats
> that I get from checkIndex -verbose.
> What does the followong mean:
> index FST bytes - the FST size, which is the field's partition of the .tip
> file?


> num of terms - written 2M, although Luke interface shows me 8M, how come?

This should be the total unique term count for that field; not sure
why Luke disagrees.  Maybe Luke is printing Terms.sumTotalTermFreq =
total number of term occurrences (not unique terms).

> term / index FST bytes - summing up all my fields bytes doesn't get me
> close to the .tim / tip file, how come?

The .tim file also contains the postings metadata (file pointers into
.doc, .pos, skip length, etc.), so .tim file size will be larger than
the terms + index bytes as reported by CheckIndex.

> blocks - these are the SUFFIX blocks (.tim files), which are implemented as
> Burst Tries, right?

Yes; this is basically the number of nodes in the prefix trie, where
each node has a block of terms' suffixes, plus their metadata
(docFreq, totalTermFreq) plus the postings metadata.

> block types - where can I get the info about these different types?

A terms-only block is a "leaf" node in the prefix trie: it has no
child sub-blocks, so all of its entries are terms.

A sub-clock only block is a non-leaf node that has only pointers to
other blocks, i.e. it contains no terms itself.

A mixed block is all other blocks: they contain at least one term and
at least one sub-block.  The total number of blocks is equal to the
sum of these three counts.

A floor block is orthogonal to the above breakdown, and is used when a
single block would contain too many entries.  This is necessary
because the problem of dividing up the terms into blocks of size 25 -
48 terms or sub-blocks is over-constrained.  Since the labels are
bytes, in the worst case you could have 255 entries in a block; when
this happens, we split the block into an initial floor block, followed
by a number of "sub-floor" blocks.  When seeking, the FST points to
each of these floor / sub-floor blocks so e.g. if you are trying to
get to byte 255 at this point, you don't have to scan through the
first 255 entries to get there.

> As background, my main performance issue is (random?) read miss IO while
> looking up terms in the BlockTreeTerm (tim files, right?) on heavy-termed
> queries, so my optimization is avoiding IO's.

Maybe try the new in-memory terms dict?  It holds all terms + metadata
in RAM so there's no seeking when looking up the term.  But the
postings are still on disk, so seeking is necessary there.  This is
currently trunk only, FSTTerms* under lucene/codec.

You could also try MemoryPostingsFormat, which holds all terms +
postings in RAM.

But you need to have enough RAM vs your index size for the fields, for these :)

> That said, is there any
> reason getting the right block will require more than <segment_count> IO
> (of 4kB)?

In the worst case it's 2 seeks per term: first the FST tells us which
block to load from the .tim file, so we seek + scan to it.  Then,
assuming you need to visit the documents for this term, it's another
seek to get to the right block for the postings, unless there was only
1 doc for this term and you indexed DOCS_ONLY, in which case that
docID is inlined into the term metadata (so only one seek).

If you also need positions, then that will be another seek; if
payloads/offsets, another seek.

> Does a certain distribution of prefix length of block types should alarm me
> in some way?
> field "text_txt"
>   index FST:
>     18300 nodes
>     45779 arc
>     583438 bytes
>   term:
>     2053393 terms
>     25597203 bytes (12.5 bytes/term)
>   blocks:
>     66086 blocks
>     51870 terms-only blocks
>     47 sub-block-only blocks
>     14169 mixed blocks
>     13599 floor blocks
>     22862 non-floor blocks
>     43224 floor sub-blcoks
>     18289568 term suffix bytes (276.8 suffix-bytes/block)
>     4174480 term stas bytes (63.2 stats-bytes/block)
>     7632796 other bytes (115.5 stats-bytes/block)
>     by prefix length:
>       0: 1
>       1: 683
>       2: 10782
>       3. 17133
> etc...

There are term distributions that are "better", but there should be no
"adversary" case that cripples the performance.

E.g. if you use UIDs, the best performance should be 0-prefixed base-N
ints, where N is over 25 and well below 48, is best, and they are
sequentially assigned.  Random UIDs are worst!

But, no distribution by prefix length should be "alarming" ...

Mike McCandless

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message