Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: solr-user@lucene.apache.org
MIME-Version: 1.0
From: Doug Turnbull <dturnbull@opensourceconnections.com>
Date: Wed, 29 Mar 2017 14:45:50 +0000
Message-ID: <CALG6HL8W_cPeXCYnVKs2eSpDsTtcZ8_RbcYqWr+ZPoXwU5APPQ@mail.gmail.com>
Subject: The downsides of not splitting on whitespace in edismax (the old
 albino elephant prob)
To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
Content-Type: multipart/alternative; boundary=001a1144d1ba99f791054bdfa26e
archived-at: Wed, 29 Mar 2017 14:46:07 -0000

--001a1144d1ba99f791054bdfa26e
Content-Type: text/plain; charset=UTF-8

So with regards to this JIRA (
https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr splitting
on whitespace optional.

I want to point out that there's not a simple fix to multi-term synonyms in
part because of specific tradeoffs. Splitting on whitespace is *someimes a
good thing*. Not splitting on whitespace (or enforcing some other
cross-field consistent token splitting behavior) actually recreates an old
problem that was the reason for creating dismax strategies in the first
place. So I'm glad we're leaving the sow option :)

If you're interested, this summarizes a bunch of historical research I did
into Lucene code for my book for why splitting on whitespace is often a
good thing

Currently the behavior of edismax is intentionally designed to be
term-centric. There's a bias towards having more of your query terms in a
relevant hit. This comes out of an old problem called "albino elephant"
that was the original reason dismax strategies came about. So if a user
searches for

albino elephant

The original Lucene query parser for search across fields would do
something like:

(title:albino OR title:elephant) OR (text:albino OR text:elephant)

TF*IDF held constant for each term, a document that matches "albino" in two
fields has the same value as a document that matches BOTH albino and
elephant. Both get 2 "hits" in the OR query above. Most users consder this
not good! I want albino elephants, not just albino things nor just elephant
things!

So disjunctionmaxquery came about because somebody realized that if they
took the per-term maximum, they could bias towards results that had more of
the user's search terms.

(title:albino | title:albino) OR (text:elephant | text:elephant)

Here the highest scored result has BOTH search terms. So a result that has
both elephant and albino will come to the top. What users typically expect.

I call this strategy "term centric" -- it biases results towards documents
with more of the users search terms. I contrast this with "field centric"
search which focuses more on the specific analysis/matching behavior of one
field (shingles/synonyms/auto phrasing/taxonomies/whatever)

This strategy by necessity requires you to have a consistent, global
definition of what's a "search term" independent of fields either by a
common analyzer across fields or by just splitting on whitespace. A common
analyzer is what BlendedTermQuery in Lucene enforces (used by ES's
cross_field search)

In other words splitting on whitespace has *benefits* and *drawbacks.* The
drawback is what we experience with Solr multiterm synonyms. If you have
one field that breaks up by shingles/some multi-term synonym behavior and
another field that tokenizes on whitespace, you can't easily pick the
document with the "most search terms" as there's no consistent definition
of search terms.

I don't know where I'm going with this, but I want to point out that fixing
multiterm synonym won't have a silver bullet. People should still expect to
be frustrated :). We should all be aware we likely recreate another problem
with a simple fix to multiterm synonym. I think there's value in some
strategy that does something like

- Base relevance with edismax, splitting on whitespace to bias towards more
search terms
- Boosts with edismax w/o splitting on whitespace (or some other QP) to
layer in the effects you want for multiterm synonyms

How you balance these ranking signals is tricky and domain specific, but I
have found this sort of strategy balances both concerns

Ok this probably should have just been a blog post, but I wanted to just
use my history degree for something useful for a change...
Best!
-Doug

--001a1144d1ba99f791054bdfa26e--