Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Date: Fri, 19 Aug 2011 17:00:28 +0000 (UTC)
From: "Michael McCandless (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Message-ID: 
 <1147681684.53366.1313773228517.JavaMail.tomcat@hel.zones.apache.org>
In-Reply-To: 
 <1515286270.59181.1302821345674.JavaMail.tomcat@hel.zones.apache.org>
Subject: [jira] [Commented] (LUCENE-3030) Block tree terms dict & index
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087797#comment-13087797 ] 

Michael McCandless commented on LUCENE-3030:
--------------------------------------------

There is a small hit to indexing perf here, somewhere between 0 - 5% or so depending on the run.  Given the gains for MTQs I think this is an OK tradeoff.  We can further optimize the BlockTreeTermsWriter later....

> Block tree terms dict & index
> -----------------------------
>
>                 Key: LUCENE-3030
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3030
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 4.0
>
>         Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch
>
>
> Our default terms index today breaks terms into blocks of fixed size
> (ie, every 32 terms is a new block), and then we build an index on top
> of that (holding the start term for each block).
> But, it should be better to instead break terms according to how they
> share prefixes.  This results in variable sized blocks, but means
> within each block we maximize the shared prefix and minimize the
> resulting terms index.  It should also be a speedup for terms dict
> intensive queries because the terms index becomes a "true" prefix
> trie, and can be used to fast-fail on term lookup (ie returning
> NOT_FOUND without having to seek/scan a terms block).
> Having a true prefix trie should also enable much faster intersection
> with automaton (but this will be a new issue).
> I've made an initial impl for this (called
> BlockTreeTermsWriter/Reader).  It's still a work in progress... lots
> of nocommits, and hairy code, but tests pass (at least once!).
> I made two new codecs, temporarily called StandardTree, PulsingTree,
> that are just like their counterparts but use this new terms dict.
> I added a new "exactOnly" boolean to TermsEnum.seek.  If that's true
> and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the
> enum is unpositioned (ie you should not call next(), docs(), etc.).
> In this approach the index and dict are tightly connected, so it does
> not support a pluggable index impl like BlockTermsWriter/Reader.
> Blocks are stored on certain nodes of the prefix trie, and can contain
> both terms and pointers to sub-blocks (ie, if the block is not a leaf
> block).  So there are two trees, tied to one another -- the index
> trie, and the blocks.  Only certain nodes in the trie map to a block
> in the block tree.
> I think this algorithm is similar to burst tries
> (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),
> except it allows terms to be stored on inner blocks (not just leaf
> blocks).  This is important for Lucene because an [accidental]
> "adversary" could produce a terms dict with way too many blocks (way
> too much RAM used by the terms index).  Still, with my current patch,
> an adversary can produce too-big blocks... which we may need to fix,
> by letting the terms index not be a true prefix trie on it's leaf
> edges.
> Exactly how the blocks are picked can be factored out as its own
> policy (but I haven't done that yet).  Then, burst trie is one policy,
> my current approach is another, etc.  The policy can be tuned to
> the terms' expected distribution, eg if it's a primary key field and
> you only use base 10 for each character then you want block sizes of
> size 10.  This can make a sizable difference on lookup cost.
> I modified the FST Builder to allow for a "plugin" that freezes the
> "tail" (changed suffix) of each added term, because I use this to find
> the blocks.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org