Return-Path: X-Original-To: apmail-lucene-dev-archive@www.apache.org Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4BB2B105C4 for ; Wed, 1 May 2013 13:14:22 +0000 (UTC) Received: (qmail 53290 invoked by uid 500); 1 May 2013 13:14:21 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 52976 invoked by uid 500); 1 May 2013 13:14:19 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 52736 invoked by uid 99); 1 May 2013 13:14:17 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 01 May 2013 13:14:17 +0000 Date: Wed, 1 May 2013 13:14:17 +0000 (UTC) From: "Michael McCandless (JIRA)" To: dev@lucene.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (LUCENE-3842) Analyzing Suggester MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646559#comment-13646559 ] Michael McCandless commented on LUCENE-3842: -------------------------------------------- bq. I'm using the wordnet synonyms, so I guess this causes a lot of paths, I suspect loops. Ahhhh :) Yes this will cause lots of expansions / RAM used. But NPE because paths is null sounds like a real bug. OK I see why it's happening ... when we try to enumerate all finite strings from the expanded graph, if it exceeds the limit (maxGraphExpansions), SpecialOperations.getFiniteStrings returns null but the code assumes it will return the N finite strings it had found "so far". Can you open a new issue for this? We should separately fix it. > Analyzing Suggester > ------------------- > > Key: LUCENE-3842 > URL: https://issues.apache.org/jira/browse/LUCENE-3842 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/spellchecker > Affects Versions: 3.6, 4.0-ALPHA > Reporter: Robert Muir > Assignee: Michael McCandless > Fix For: 4.1, 5.0 > > Attachments: LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842.patch, LUCENE-3842-TokenStream_to_Automaton.patch > > > Since we added shortest-path wFSA search in LUCENE-3714, and generified the comparator in LUCENE-3801, > I think we should look at implementing suggesters that have more capabilities than just basic prefix matching. > In particular I think the most flexible approach is to integrate with Analyzer at both build and query time, > such that we build a wFST with: > input: analyzed text such as ghost0christmas0past <-- byte 0 here is an optional token separator > output: surface form such as "the ghost of christmas past" > weight: the weight of the suggestion > we make an FST with PairOutputs, but only do the shortest path operation on the weight side (like > the test in LUCENE-3801), at the same time accumulating the output (surface form), which will be the actual suggestion. > This allows a lot of flexibility: > * Using even standardanalyzer means you can offer suggestions that ignore stopwords, e.g. if you type in "ghost of chr...", > it will suggest "the ghost of christmas past" > * we can add support for synonyms/wdf/etc at both index and query time (there are tradeoffs here, and this is not implemented!) > * this is a basis for more complicated suggesters such as Japanese suggesters, where the analyzed form is in fact the reading, > so we would add a TokenFilter that copies ReadingAttribute into term text to support that... > * other general things like offering suggestions that are more "fuzzy" like using a plural stemmer or ignoring accents or whatever. > According to my benchmarks, suggestions are still very fast with the prototype (e.g. ~ 100,000 QPS), and the FST size does not > explode (its short of twice that of a regular wFST, but this is still far smaller than TST or JaSpell, etc). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org