lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <>
Subject [jira] [Updated] (LUCENE-3030) Block tree terms dict & index
Date Tue, 02 Aug 2011 18:16:27 GMT


Michael McCandless updated LUCENE-3030:

    Attachment: BlockTree.png

The block tree terms dict seems to be working... all tests pass w/
StandardTree codec.  There's still more to do (many nocommits), but, I
think the perf results should be close to what I finally commit:

||Task||QPS base||StdDev base||QPS blocktree||StdDev blocktree||Pct diff||||

This is for a multi-segment index, 10 M wikipedia docs, using luceneutil.

These are huge speedups for the terms-dict intensive queries!

The two FuzzyQuerys and Respell get the speedup from the directly
implemented intersect method, and the PKLookup gets gains because it
can often avoid seeking since block tree's terms index can sometimes
rule out terms by their prefix (though, this relies on the PK terms
being "predictable" -- I use "%09d" w/ a counter, now; if you instead
used something more random looking (GUIDs )I don't think we'd see

> Block tree terms dict & index
> -----------------------------
>                 Key: LUCENE-3030
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>         Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch,
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (,
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

This message is automatically generated by JIRA.
For more information on JIRA, see:


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message