Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4B3E07FA7 for ; Fri, 19 Aug 2011 17:00:55 +0000 (UTC) Received: (qmail 24431 invoked by uid 500); 19 Aug 2011 17:00:53 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 24384 invoked by uid 500); 19 Aug 2011 17:00:53 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 24376 invoked by uid 99); 19 Aug 2011 17:00:52 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Aug 2011 17:00:52 +0000 X-ASF-Spam-Status: No, hits=-2001.1 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Aug 2011 17:00:49 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 83690C4158 for ; Fri, 19 Aug 2011 17:00:28 +0000 (UTC) Date: Fri, 19 Aug 2011 17:00:28 +0000 (UTC) From: "Michael McCandless (JIRA)" To: dev@lucene.apache.org Message-ID: <1147681684.53366.1313773228517.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <1515286270.59181.1302821345674.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3030) Block tree terms dict & index MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-3030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13087797#comment-13087797 ] Michael McCandless commented on LUCENE-3030: -------------------------------------------- There is a small hit to indexing perf here, somewhere between 0 - 5% or so depending on the run. Given the gains for MTQs I think this is an OK tradeoff. We can further optimize the BlockTreeTermsWriter later.... > Block tree terms dict & index > ----------------------------- > > Key: LUCENE-3030 > URL: https://issues.apache.org/jira/browse/LUCENE-3030 > Project: Lucene - Java > Issue Type: Improvement > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: BlockTree.png, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch, LUCENE-3030.patch > > > Our default terms index today breaks terms into blocks of fixed size > (ie, every 32 terms is a new block), and then we build an index on top > of that (holding the start term for each block). > But, it should be better to instead break terms according to how they > share prefixes. This results in variable sized blocks, but means > within each block we maximize the shared prefix and minimize the > resulting terms index. It should also be a speedup for terms dict > intensive queries because the terms index becomes a "true" prefix > trie, and can be used to fast-fail on term lookup (ie returning > NOT_FOUND without having to seek/scan a terms block). > Having a true prefix trie should also enable much faster intersection > with automaton (but this will be a new issue). > I've made an initial impl for this (called > BlockTreeTermsWriter/Reader). It's still a work in progress... lots > of nocommits, and hairy code, but tests pass (at least once!). > I made two new codecs, temporarily called StandardTree, PulsingTree, > that are just like their counterparts but use this new terms dict. > I added a new "exactOnly" boolean to TermsEnum.seek. If that's true > and the term is NOT_FOUND, we will (quickly) return NOT_FOUND and the > enum is unpositioned (ie you should not call next(), docs(), etc.). > In this approach the index and dict are tightly connected, so it does > not support a pluggable index impl like BlockTermsWriter/Reader. > Blocks are stored on certain nodes of the prefix trie, and can contain > both terms and pointers to sub-blocks (ie, if the block is not a leaf > block). So there are two trees, tied to one another -- the index > trie, and the blocks. Only certain nodes in the trie map to a block > in the block tree. > I think this algorithm is similar to burst tries > (http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499), > except it allows terms to be stored on inner blocks (not just leaf > blocks). This is important for Lucene because an [accidental] > "adversary" could produce a terms dict with way too many blocks (way > too much RAM used by the terms index). Still, with my current patch, > an adversary can produce too-big blocks... which we may need to fix, > by letting the terms index not be a true prefix trie on it's leaf > edges. > Exactly how the blocks are picked can be factored out as its own > policy (but I haven't done that yet). Then, burst trie is one policy, > my current approach is another, etc. The policy can be tuned to > the terms' expected distribution, eg if it's a primary key field and > you only use base 10 for each character then you want block sizes of > size 10. This can make a sizable difference on lookup cost. > I modified the FST Builder to allow for a "plugin" that freezes the > "tail" (changed suffix) of each added term, because I use this to find > the blocks. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org