lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-4381) Query-time multi-word synonym expansion
Date Wed, 30 Jan 2013 15:23:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566539#comment-13566539
] 

Jack Krupansky commented on SOLR-4381:
--------------------------------------

I have personally implemented multi-word synonym support within a query parser, bypassing
analysis for synonym processing as you suggest, but still examining the analysis chain to
discover and load the field-specific synonym table. Yes, that approach can work, but I have
refrained from proposing such a solution in Solr/Lucene since it is rather messy and not really
an ideal solution because it does bypass analysis. There are ongoing discussions on the Lucene/Solr
lists about how best to address query-time synonym processing; there have actually been some
hopeful suggestions recently, but still a long way to go. I would rather see those discussions
continue and come to fruition than see edismax changed in a way that would be incompatible
with a more ideal solution.

I suppose you could simply have your patch remain a patch forever without integration into
the Solr code base, for people who are desperate to have the feature in edismax, but due to
its far-from-ideal nature (bypassing analysis and not supporting field-specific synonym tables),
it would seem less likely to be integrated into the Solr code base since it would interfere
with a broader solution. Note that I am NOT a committer, so I would have no official say in
the matter. This is just my own opinion.

I suppose you could also package it as a separate "contrib" query parser and then it could
be integrated into a Solr release and be available to anybody without the need for patching.
That might be the more fruitful approach for near-term integration.

But I would definitely be -1 for direct integration into edismax since it does bypass analysis
(and as an incidental objection doesn't support field-specific synonym tables.) Analysis is
really important and gives the developer fine-tuning control over field-specific processing
without changing any code.

OTOH, if it could be turned on and off dynamically with a request parameter, maybe direct
integration into the Solr code base would be feasible. IOW, if it is simply a user-selectable
"plugin", that would be more compelling.

Again, I am not a committer, so my opinion here can be freely ignored.

                
> Query-time multi-word synonym expansion
> ---------------------------------------
>
>                 Key: SOLR-4381
>                 URL: https://issues.apache.org/jira/browse/SOLR-4381
>             Project: Solr
>          Issue Type: Improvement
>          Components: query parsers
>            Reporter: Nolan Lawson
>            Priority: Minor
>              Labels: multi-word, queryparser, synonyms
>             Fix For: 4.2, 5.0
>
>         Attachments: SOLR-4381.patch
>
>
> This is an issue that seems to come up perennially.
> The [Solr docs|http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory]
caution that index-time synonym expansion should be preferred to query-time synonym expansion,
due to the way multi-word synonyms are treated and how IDF values can be boosted artificially.
But query-time expansion should have huge benefits, given that changes to the synonyms don't
require re-indexing, the index size stays the same, and the IDF values for the documents don't
get permanently altered.
> The proposed solution is to move the synonym expansion logic from the analysis chain
(either query- or index-type) and into a new QueryParser.  See the attached patch for an implementation.
> The core Lucene functionality is untouched.  Instead, the EDismaxQParser is extended,
and synonym expansion is done on-the-fly.  Queries are parsed into a lattice (i.e. all possible
synonym combinations), while individual components of the query are still handled by the EDismaxQParser
itself.
> It's not an ideal solution by any stretch. But it's nice and self-contained, so it invites
experimentation and improvement.  And I think it fits in well with the merry band of misfit
query parsers, like {{func}} and {{frange}}.
> More details about this solution can be found in [this blog post|http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/]
and [the Github page for the code|https://github.com/healthonnet/hon-lucene-synonyms].
> At the risk of tooting my own horn, I also think this patch sufficiently fixes SOLR-3390
(highlighting problems with multi-word synonyms) and LUCENE-4499 (better support for multi-word
synonyms).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message