Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 43141 invoked from network); 24 May 2010 12:36:59 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 24 May 2010 12:36:59 -0000 Received: (qmail 84690 invoked by uid 500); 24 May 2010 12:36:58 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 84625 invoked by uid 500); 24 May 2010 12:36:58 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 84618 invoked by uid 99); 24 May 2010 12:36:57 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 May 2010 12:36:57 +0000 X-ASF-Spam-Status: No, hits=-1457.5 required=10.0 tests=ALL_TRUSTED,AWL X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 May 2010 12:36:57 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o4OCaaOX007535 for ; Mon, 24 May 2010 12:36:36 GMT Message-ID: <30229613.8511274704596189.JavaMail.jira@thor> Date: Mon, 24 May 2010 08:36:36 -0400 (EDT) From: "Michael McCandless (JIRA)" To: dev@lucene.apache.org Subject: [jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym expansion at indexing time). MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870588#action_12870588 ] Michael McCandless commented on LUCENE-1622: -------------------------------------------- Here's the dev thread that lead to this issue, for context: http://www.lucidimagination.com/search/document/fde6d4b979481398/synonym_filter_with_support_for_phrases I think the syn filter here takes generally the same approach as Solr's (now moved to modules/analyzer in trunk) SynonymFilter, ie overlapping words as the expanded synonyms unwind? Are there salient differences between the two? Maybe we can merge them and get best of both worlds? There are tricky tradeoffs of index time vs search time -- index time is less flexible (you must re-index on changing them) but better search perf (OR in a TermQuery instead of expanding to many PhraseQuerys); index time is better scoring (the IDF is "true" if the syn is a term in the index, vs PhraseQuery which necessarily approximates, possibly badly). There is also the controversial question of whether using manually defined synonyms even helps relevance :) As Robert points out, doing an iteration of feedback (take the top N docs, that match user's query, extract their salient terms, and do a 2nd search expanded w/ those salient terms), sort of accomplishes something similar (and perhaps better since it's not just synonyms but also uncovers "relationships" like Barack Obama is a US president), but w/o the manual effort of creating the synonyms. And it's been shown to improve relevance. Still, I think Lucene should make index and query time expansion feasible. At the plumbing level we don't have a horse in that race :) If you do index syns at index time, you really should just inject a single syn token, representing any occurrence of a term/phrase that this synonym accepts (and do the matching thing @ query time). But, then, as Earwin pointed out, Lucene is missing the notion of "span" saying how many positions this term took up (we only encode the pos incr, reflecting where this token begins relative to the last token's beginning). EG if "food place" is a syn for "restaurant", and you have a doc "... a great food place in boston ...", and so you inject RESTAURANT (syn group) "over" the phrase "food place", then an exact phrase query won't work right -- you can't have "a great RESTAURANT in boston" match. One simple way to express this during analysis is as a new SpanAttr (say), which expresses how many positions the token takes up. We could then index this, doing so efficiently for the default case (span==1), and then in addition to getting the .nextPosition() you could then also ask for .span() from DocsAndPositionsEnum. But, generalizing this a bit, really we are indexing a graph, where the nodes are positions and the edges are tokens connecting them. With only posIncr & span, you restrict the nodes to be a single linear chain; but if we generalize it, then nodes can be part of side branches; eg the node in the middle of "food place" need not be a "real" position if it were injected into a document / query containing restaurant. Hard boundaries (eg b/w sentences) would be more cleanly represented here -- there would not even be an edge between the nodes. We'd then need an AutomatonWordQuery -- the same idea as AutomatonQuery, except at the word level not at the character level. MultiPhraseQuery would then be a special case of AutomatonWordQuery. Then analysis becomes the serializing of this graph... analysis would have to flatten out the nodes into a single linear chain, and then express the edges using position & span. I think position would no longer be a hard relative position. EG when injecting "food place" (= 2 tokens) into the tokens that contain restaurant, both food and restaurant would have the same start position, but food would have span 1 and restaurant would have span 2. (Sorry for the rambling... this is a complex topic!!). > Multi-word synonym filter (synonym expansion at indexing time). > --------------------------------------------------------------- > > Key: LUCENE-1622 > URL: https://issues.apache.org/jira/browse/LUCENE-1622 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/analyzers > Reporter: Dawid Weiss > Priority: Minor > Attachments: synonyms.patch > > > It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens). > The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well): > - if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., "big" in the synonym "big apple" for "new york city") causes the document to match; > - there are problems with highlighting the original document when synonym is matched (see unit tests for an example), > - if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won't be found. Example "big apple" synonym for "new york city". A phrase query "big apple restaurants" won't match "new york city restaurants". > I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org