Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 17477 invoked from network); 3 Dec 2010 18:04:39 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 3 Dec 2010 18:04:39 -0000 Received: (qmail 83647 invoked by uid 500); 3 Dec 2010 18:04:36 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 83604 invoked by uid 500); 3 Dec 2010 18:04:36 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 83596 invoked by uid 99); 3 Dec 2010 18:04:36 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Dec 2010 18:04:36 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 03 Dec 2010 18:04:33 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id oB3I4B1W029507 for ; Fri, 3 Dec 2010 18:04:11 GMT Message-ID: <25403746.97251291399451750.JavaMail.jira@thor> Date: Fri, 3 Dec 2010 13:04:11 -0500 (EST) From: "Robert Muir (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-2792) Add a simple FST impl to Lucene In-Reply-To: <4884951.91371291372270588.JavaMail.jira@thor> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/LUCENE-2792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12966604#action_12966604 ] Robert Muir commented on LUCENE-2792: ------------------------------------- bq. Can the FST be made concurrent for use in LUCENE-2312 as the terms dict implementation? Just an idea, maybe you could use something like ConcurrentSkipListMap for this until mike makes "fst 2.0" or something? its only in java 6 but you can poach from apache harmony... > Add a simple FST impl to Lucene > ------------------------------- > > Key: LUCENE-2792 > URL: https://issues.apache.org/jira/browse/LUCENE-2792 > Project: Lucene - Java > Issue Type: New Feature > Components: Index > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: FSTExample.png, LUCENE-2792.patch > > > I implemented the algo described at > http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.24.3698 for > incrementally building a finite state transducer (FST) from sorted > inputs. > This is not a fully general FST impl -- it's only able to build up an > FST incrementally from input/output pairs that are pre-sorted. > Currently the inputs are BytesRefs, and the outputs are pluggable -- > NoOutputs gets you a simple FSA, PositiveIntOutputs maps to a long, > ByteSequenceOutput maps to a BytesRef. > The implementation has a low memory overhead, so that it can handle a > fairly large set of terms. For example, it can build the FSA for the > 9.8M terms from a 10M document wikipedia index in ~8 seconds (on > beast), using ~256 MB peak RAM, resulting in an FSA that's ~60 MB. > It packs the FST as-it-builds into a compact byte[], and then exposes > the API to read nodes/arcs directly from the byte[]. The FST can be > quickly saved/loaded to/from a Directory since it's just a big byte[]. > The format is similar to what Morfologik uses > (http://sourceforge.net/projects/morfologik/). > I think there are a number of possible places we can use this in > Lucene. For example, I think many apps could hold the entire terms > dict in RAM, either at the multi-reader level or maybe per-segment > (mapping to file offset or to something else custom to the app), which > may possibly be a good speedup for certain MTQs (though, because the > format is packed into a byte[], there is a decode cost when visiting > arcs). > The builder can also prune as it goes, so you get a prefix trie pruned > according to how many terms run through the nodes, which makes it > faster and even less memory consuming. This may be useful as a > replacement for our current binary search terms index since it can > achieve higher term density for the same RAM consumption of our > current index. > As an initial usage to make sure this is exercised, I cutover the > SimpleText codec, which currently fully loads all terms into a > TreeMap (and has caused intermittent OOME in some tests), to use an FST > instead. SimpleText uses a PairOutputs which is able to "pair up" any > two other outputs, since it needs to map each input term to an int > docFreq and long filePosition. > All tests pass w/ SimpleText forced codec, and I think this is > committable except I'd love to get some help w/ the generics > (confession to the policeman: I had to add > @SuppressWarnings({"unchecked"})) all over!! Ideally an FST is > parameterized by its output type (Integer, BytesRef, etc.). > I even added a new @nightly test that makes a largeish set of random > terms and tests the resulting FST on different outputs :) > I think it would also be easy to make a variant that uses char[] > instead of byte[] as its inputs, so we could eg use this during analysis > (Robert's idea). It's already be easy to have a CharSequence > output type since the outputs are pluggable. > Dawid Weiss (author of HPPC -- http://labs.carrotsearch.com/hppc.html -- and > Morfologik -- http://sourceforge.net/projects/morfologik/) > was very helpful iterating with me on this (thank you!). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org