Return-Path: Delivered-To: apmail-lucene-java-dev-archive@www.apache.org Received: (qmail 37704 invoked from network); 9 Jun 2006 15:10:43 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (209.237.227.199) by minotaur.apache.org with SMTP; 9 Jun 2006 15:10:43 -0000 Received: (qmail 76396 invoked by uid 500); 9 Jun 2006 15:10:41 -0000 Delivered-To: apmail-lucene-java-dev-archive@lucene.apache.org Received: (qmail 75781 invoked by uid 500); 9 Jun 2006 15:10:38 -0000 Mailing-List: contact java-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-dev@lucene.apache.org Delivered-To: mailing list java-dev@lucene.apache.org Received: (qmail 75770 invoked by uid 99); 9 Jun 2006 15:10:38 -0000 Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jun 2006 08:10:38 -0700 X-ASF-Spam-Status: No, hits=0.0 required=10.0 tests= X-Spam-Check-By: apache.org Received-SPF: pass (asf.osuosl.org: local policy) Received: from [64.90.160.18] (HELO server1.threattracker.com) (64.90.160.18) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 09 Jun 2006 08:10:36 -0700 Received: from [192.168.1.101] (cpe-68-175-96-170.nyc.res.rr.com [68.175.96.170]) (authenticated) by server1.threattracker.com (8.11.6/8.11.6) with ESMTP id k59FALZ24005 for ; Fri, 9 Jun 2006 11:10:21 -0400 Message-ID: <44898F0C.10001@alias-i.com> Date: Fri, 09 Jun 2006 11:09:00 -0400 From: Bob Carpenter User-Agent: Thunderbird 1.5.0.4 (Windows/20060516) MIME-Version: 1.0 To: java-dev@lucene.apache.org Subject: Re: Edit-distance strategy (slicing and one vs. all algorithms) References: <20060608190310.97640.qmail@web25909.mail.ukl.yahoo.com> In-Reply-To: <20060608190310.97640.qmail@web25909.mail.ukl.yahoo.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org X-Spam-Rating: minotaur.apache.org 1.6.2 0/1000/N > Can you share how you implemented Trie in your App, especialy interesting part for me is how you go about memory consumption, > have you tried really large dictionaries (1Mio+)? Sure can. You could slog through the source, which is available from: http://www.alias-i.com/lingpipe There are several trie implementations. The approximate dictionary trie's implemented very crudely. We've run it on 1M gene names derived from EntrezGene (anywhere from 2 to 60 characters long). I don't know the speed numbers offhand, but nothing to write home about. The real problem is that it too easily matched short terms to short terms within a given edit distance. So a gene name AT was matching GT or IT or whatever at edit-distance 1, whereas alpha-ACT wasn't matching AACT like we wanted. Still haven't solved this problem -- we went back to using the method we described in our case study in Lucene in Action. The spelling checker trie for known tokens is also very crude, as it's not a bottleneck in that processing (not even .1% of the profiled time). We've scaled that to several million entries in a large customer application. The language model tries were, though. The trie underlying our language models is implemented pretty space-efficiently in-memory. It only stores counts at each node, not general records or even terminal/non-terminal information as you'd need in a word list. It codes trie nodes to an interface with PAT implementations, shared terminal node implementations, etc., with unfolding all the way down to elements so that nodes with 1-3 daughters don't allocate arrays, counts below 256 are represented with bytes, etc. It uses a factory to sort out all the object creations/sharing. I wouldn't normally be so agressive optimzing this kind of thing, but memory and speed were a real problem with the naive implementation, and this sits in most of our inner loops for model training. The tries can be compiled into a form where they can be easily written to disk and read back in for efficient run-time use for language modeling (estimating probability of a string given a topic, language, sentiment, etc.). That's the critical step in most of our run-time operations. Anyway, there's a paper on the scalable LM tries that covers everything in pretty excruciating detail (even more than this message): http://www.colloquial.com/carp/Publications/acl05soft-carpenter.pdf In LingPipe 2.3, which should be out next week, there are methods to write character tries to disk and to merge on-disk tries with and without pruning. This uses some of the same bit-level coding techniques (gamma-coding, specifically) as reverse-indexes for search engines, and yields a very tight memory representation with good merging properties. I've basically adopted Lucene's strategy of on-disk representations writing out to a mergeable streaming format that can scale on disk (or in memory). - Bob Carpenter eks dev wrote: > Hi Bob, > > really nice answer! > > >> The real gain would be to do something like the >> edit-distance generalization of Aho-Corasick. The >> basic idea is that instead of n iterations of string vs. string, >> you do one iteration of string vs. trie. >> > > I was experimenting a bit with ternary trie as it has some nice properties, e.g being 40% faster than standard java or trove HashMap for exact matches, but never got to finish path compression and null node reduction (this way one saves huge amount of memory). Must do it one day. > > > thanks! > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-dev-help@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org For additional commands, e-mail: java-dev-help@lucene.apache.org