Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@lucene.apache.org
Message-ID: <30229613.8511274704596189.JavaMail.jira@thor>
Date: Mon, 24 May 2010 08:36:36 -0400 (EDT)
From: "Michael McCandless (JIRA)" <jira@apache.org>
To: dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-1622) Multi-word synonym filter (synonym
 expansion at indexing time).
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/LUCENE-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870588#action_12870588 ] 

Michael McCandless commented on LUCENE-1622:
--------------------------------------------

Here's the dev thread that lead to this issue, for context:

  http://www.lucidimagination.com/search/document/fde6d4b979481398/synonym_filter_with_support_for_phrases

I think the syn filter here takes generally the same approach as
Solr's (now moved to modules/analyzer in trunk) SynonymFilter, ie
overlapping words as the expanded synonyms unwind?  Are there salient
differences between the two?  Maybe we can merge them and get best of
both worlds?

There are tricky tradeoffs of index time vs search time -- index time
is less flexible (you must re-index on changing them) but better
search perf (OR in a TermQuery instead of expanding to many
PhraseQuerys); index time is better scoring (the IDF is "true" if the
syn is a term in the index, vs PhraseQuery which necessarily
approximates, possibly badly).

There is also the controversial question of whether using manually
defined synonyms even helps relevance :) As Robert points out, doing
an iteration of feedback (take the top N docs, that match user's
query, extract their salient terms, and do a 2nd search expanded w/
those salient terms), sort of accomplishes something similar (and
perhaps better since it's not just synonyms but also uncovers
"relationships" like Barack Obama is a US president), but w/o the
manual effort of creating the synonyms.  And it's been shown to
improve relevance.

Still, I think Lucene should make index and query time expansion
feasible.  At the plumbing level we don't have a horse in that race :)

If you do index syns at index time, you really should just inject a
single syn token, representing any occurrence of a term/phrase that
this synonym accepts (and do the matching thing @ query time).  But,
then, as Earwin pointed out, Lucene is missing the notion of "span"
saying how many positions this term took up (we only encode the pos
incr, reflecting where this token begins relative to the last token's
beginning).

EG if "food place" is a syn for "restaurant", and you have a doc
"... a great food place in boston ...", and so you inject RESTAURANT (syn
group) "over" the phrase "food place", then an exact phrase query
won't work right -- you can't have "a great RESTAURANT in boston"
match.

One simple way to express this during analysis is as a new SpanAttr
(say), which expresses how many positions the token takes up.  We
could then index this, doing so efficiently for the default case
(span==1), and then in addition to getting the .nextPosition() you
could then also ask for .span() from DocsAndPositionsEnum.

But, generalizing this a bit, really we are indexing a graph, where
the nodes are positions and the edges are tokens connecting them.
With only posIncr & span, you restrict the nodes to be a single linear
chain; but if we generalize it, then nodes can be part of side
branches; eg the node in the middle of "food place" need not be a
"real" position if it were injected into a document / query containing
restaurant.  Hard boundaries (eg b/w sentences) would be more cleanly
represented here -- there would not even be an edge between the nodes.

We'd then need an AutomatonWordQuery -- the same idea as
AutomatonQuery, except at the word level not at the character level.
MultiPhraseQuery would then be a special case of AutomatonWordQuery.

Then analysis becomes the serializing of this graph... analysis would
have to flatten out the nodes into a single linear chain, and then
express the edges using position & span.  I think position would no
longer be a hard relative position.  EG when injecting "food place" (=
2 tokens) into the tokens that contain restaurant, both food and
restaurant would have the same start position, but food would have
span 1 and restaurant would have span 2.

(Sorry for the rambling... this is a complex topic!!).


> Multi-word synonym filter (synonym expansion at indexing time).
> ---------------------------------------------------------------
>
>                 Key: LUCENE-1622
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1622
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/analyzers
>            Reporter: Dawid Weiss
>            Priority: Minor
>         Attachments: synonyms.patch
>
>
> It would be useful to have a filter that provides support for indexing-time synonym expansion, especially for multi-word synonyms (with multi-word matching for original tokens).
> The problem is not trivial, as observed on the mailing list. The problems I was able to identify (mentioned in the unit tests as well):
> - if multi-word synonyms are indexed together with the original token stream (at overlapping positions), then a query for a partial synonym sequence (e.g., "big" in the synonym "big apple" for "new york city") causes the document to match;
> - there are problems with highlighting the original document when synonym is matched (see unit tests for an example),
> - if the synonym is of different length than the original sequence of tokens to be matched, then phrase queries spanning the synonym and the original sequence boundary won't be found. Example "big apple" synonym for "new york city". A phrase query "big apple restaurants" won't match "new york city restaurants".
> I am posting the patch that implements phrase synonyms as a token filter. This is not necessarily intended for immediate inclusion, but may provide a basis for many people to experiment and adjust to their own scenarios.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org