lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <j...@basetechnology.com>
Subject Proposal: Full support for multi-word synonyms at query time
Date Fri, 10 Aug 2012 17:36:08 GMT
One of the ongoing potholes of Solr and Lucene is lack of full support for multi-word synonyms
at query time. The root of the problem is twofold: individual terms are presented for analysis
which precludes recognition of multi-term synonyms, and the output stream from the analyis
process is a single, linear stream without regard to any graph/lattice structure for multiple
synonyms.

I intend to file a Jira, but wanted to get some wide attention and feedback on whether people
are ready to finally tackle this ongoing thorn in the side of an otherwise fantastic enterprise
search tool.

My proposed solution is fourfold:

1. Add an attribute, call it “path” for now, to the analysis process so that tokens coming
out of the analysis in a linear stream can be easily reconstituted into the graph/lattice
for multiple synonyms (single or multi-term) at the same position in a token sequence. There
could be multiple paths at a position and paths can be nested, possibly using a dot notation
such as “1.3.2”. There may be better ways to do this – this is just an initial proposal
to get the ball rolling.
2. Add a utility class and method for analysis for query parsers to present a sequence of
adjacent terms, rather than a single term at a time, so that multiword synonyms can be recognized.
Query parsers would be expected to present a “term sequence” – sequence of adjacent
terms without intervening operators – at one time.
3. Add a Query generation class and method that can take the graph/lattice for a token sequence
containing nested synonym alternatives and generate the appropriate Query structure with BooleanQuery
SHOULD or SpanOrQuery to implement synonym alternatives at a given position.
4. Modify the most popular query parsers to use the new analysis/generation.

Obviously there are lots of fine details to resolve.

What I wanted to do right now is see if there is general support for pushing forward with
such a radical change, say for Lucene and Solr 5.0, or I suppose some 4.x > 4.0.

If I get enough support, I’ll file the Jira. Otherwise, I’ll just wait a year and then
try again.

I’m not personally committing to do the actual work, but simply to get the ball rolling
and keep it rolling. I’ll do work to the extent that nobody else is jumping in first. And
I certainly don’t want to propose some giant patch that never gets approved and has to be
constantly updated as the rest of Lucene/Solr changes. I would home that pieces of this large
task could be carved off and committed incrementally to avoid having a monster patch at the
end.

So, the questions (primarily for committers) for now are:

1. Do people want to see this go forward now (reasonably near future as opposed to more than
a year away)?
2. Does the overall approach seem feasible and low enough risk?
3. Will this approach provide people with search results they expect?
4. Is this a high enough value feature change to justify the effort?

As far as support for multi-word synonyms at index time... uhhhhh... that’s another story.
I think the two (query vs. index) can be separated. The basic problem at index time is that
if you index “heart attack” and “myocardial infarction” at the same positions, queries
of “heart infarction” and “myocardial attack” will have false matches. And if the
list of synonyms have varying lengths, the position of the next term will be off for phrase
queries. In any case, I am proposing moving forward with a full solution at query time only,
for now.

-- Jack Krupansky
Mime
View raw message