Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 82D59969F for ; Tue, 14 Feb 2012 16:02:25 +0000 (UTC) Received: (qmail 3856 invoked by uid 500); 14 Feb 2012 16:02:24 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 3792 invoked by uid 500); 14 Feb 2012 16:02:24 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 3785 invoked by uid 99); 14 Feb 2012 16:02:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Feb 2012 16:02:24 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=5.0 tests=ALL_TRUSTED,T_RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 14 Feb 2012 16:02:20 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id 940E21B7F08 for ; Tue, 14 Feb 2012 16:02:00 +0000 (UTC) Date: Tue, 14 Feb 2012 16:02:00 +0000 (UTC) From: "Robert Muir (Commented) (JIRA)" To: dev@lucene.apache.org Message-ID: <513878923.36831.1329235320608.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <368047245.63752.1327173220043.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Commented] (LUCENE-3714) add suggester that uses shortest path/wFST instead of buckets MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207798#comment-13207798 ] Robert Muir commented on LUCENE-3714: ------------------------------------- I played with a lot of variations on this patch: * generified the shortest path method to take things like Comparators/Comparable * tried different algebra/representations with that I think we should keep the code wired to Long for now. according to the benchmark any generification is like a 5-10% overall perf hit, and I don't see a need for anything but Long. I think as far as representation, we need to integrate the offline sort, find min/max float values and scale to space, e.g. if precision is 32 then Integer.MAX_VALUE - scaledWeight. I tried different representations and they just add more complexity (e.g. negative outputs), without saving much space at all. This patch uses Integer precision and is only 10% larger than the previous impl. We don't even need precision to be configurable really, we could wire it to integers as a start. But maybe later someone could specify it, e.g. if they specified 8 then they basically get the same result as bucketed algorithm today... > add suggester that uses shortest path/wFST instead of buckets > ------------------------------------------------------------- > > Key: LUCENE-3714 > URL: https://issues.apache.org/jira/browse/LUCENE-3714 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/spellchecker > Reporter: Robert Muir > Attachments: LUCENE-3714.patch, LUCENE-3714.patch, LUCENE-3714.patch, LUCENE-3714.patch, LUCENE-3714.patch, TestMe.java, out.png > > > Currently the FST suggester (really an FSA) quantizes weights into buckets (e.g. single byte) and puts them in front of the word. > This makes it fast, but you lose granularity in your suggestions. > Lately the question was raised, if you build lucene's FST with positiveintoutputs, does it behave the same as a tropical semiring wFST? > In other words, after completing the word, we instead traverse min(output) at each node to find the 'shortest path' to the > best suggestion (with the highest score). > This means we wouldnt need to quantize weights at all and it might make some operations (e.g. adding fuzzy matching etc) a lot easier. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org