lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nolan Lawson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SOLR-4381) Query-time multi-word synonym expansion
Date Wed, 30 Jan 2013 11:15:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-4381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13566378#comment-13566378
] 

Nolan Lawson commented on SOLR-4381:
------------------------------------

Hi Jan.  Thanks for the speedy reply!  In answer to your questions:

{quote}
Question is whether each query parser would need its own implementation or if it could be
generalized?
{quote}

I agree that it would be nice to abstract the code out of just EDisMax.  I think this parser
could subclass DisMax just as easily as EDisMax, or it could be abstracted out into its own
class that takes either DisMax or EDisMax as a constructor argument and then delegates to
it.  But for the Lucene parser it might be a bit more complicated, because I specifically
check for some DisMax parameters (e.g. QF), plus there is some code copied from EDisMax itself
where it's private rather than protected (e.g. [these lines|https://github.com/healthonnet/hon-lucene-synonyms/blob/master/src/main/java/org/apache/solr/search/SynonymExpandingExtendedDismaxQParserPlugin.java#L481]).
 Cleverer folks than me in the Lucene project might know a better way to do this, though.

{quote}
A suggestion to allow that in your approach could be for the QP to inspect the query analysis
chain for each field in qf, and if it finds a SynoymFilterFactory, it will use that dictionary
instead of the global one (and of course disable the analysis filter).
{quote}

I agree that the less configuration, the better.  However, I kind of like leaving the SynonymFilterFactory
out of the analysis chains, because it makes it clearer that the synonym expansion logic isn't
happening there at all. Plus, in most of the use cases we've seen, the only difference between
the query-time analyzer and the index-time analyzer was the SynonymFilterFactory itself, so
removing it gained us some code simplicity, by allowing us to define just one analyzer for
both.  Perhaps other folks have had different experiences, though.
                
> Query-time multi-word synonym expansion
> ---------------------------------------
>
>                 Key: SOLR-4381
>                 URL: https://issues.apache.org/jira/browse/SOLR-4381
>             Project: Solr
>          Issue Type: Improvement
>          Components: query parsers
>            Reporter: Nolan Lawson
>            Priority: Minor
>              Labels: multi-word, queryparser, synonyms
>             Fix For: 4.2, 5.0
>
>         Attachments: SOLR-4381.patch
>
>
> This is an issue that seems to come up perennially.
> The [Solr docs|http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory]
caution that index-time synonym expansion should be preferred to query-time synonym expansion,
due to the way multi-word synonyms are treated and how IDF values can be boosted artificially.
But query-time expansion should have huge benefits, given that changes to the synonyms don't
require re-indexing, the index size stays the same, and the IDF values for the documents don't
get permanently altered.
> The proposed solution is to move the synonym expansion logic from the analysis chain
(either query- or index-type) and into a new QueryParser.  See the attached patch for an implementation.
> The core Lucene functionality is untouched.  Instead, the EDismaxQParser is extended,
and synonym expansion is done on-the-fly.  Queries are parsed into a lattice (i.e. all possible
synonym combinations), while individual components of the query are still handled by the EDismaxQParser
itself.
> It's not an ideal solution by any stretch. But it's nice and self-contained, so it invites
experimentation and improvement.  And I think it fits in well with the merry band of misfit
query parsers, like {{func}} and {{frange}}.
> More details about this solution can be found in [this blog post|http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/]
and [the Github page for the code|https://github.com/healthonnet/hon-lucene-synonyms].
> At the risk of tooting my own horn, I also think this patch sufficiently fixes SOLR-3390
(highlighting problems with multi-word synonyms) and LUCENE-4499 (better support for multi-word
synonyms).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message