lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Turnbull <dturnb...@opensourceconnections.com>
Subject Re: The downsides of not splitting on whitespace in edismax (the old albino elephant prob)
Date Wed, 29 Mar 2017 14:49:26 GMT
What triggered me to send this was seeing this

> When per-field query structures differ, e.g. when one field's analyzer
removes stopwords and another's doesn't, edismax's DisjunctionMaxQuery
structure when sow=false differs from that produced when sow=true. Briefly,
sow=true produces a boolean query containing one dismax query per query
term, while sow=false produces a dismax query containing one boolean query
per field. Min-should-match processing does (what I think is) the right
thing here. See
TestExtendedDismaxParser.testSplitOnWhitespace_Different_Field_Analysis() for
some examples of this. *Note*: when sow=false and all queried fields' query
structure is the same, edismax does what it has always done: produce a
boolean query containing one dismax query per term.

So just be careful because this switches edismax towards a per-field dismax
(correct me if I'm wrong here) as opposed to per-term. If I understand this
correctly, you may run into a different set of problems along the albino
elephant spectrum when sow=true

On Wed, Mar 29, 2017 at 10:45 AM Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> So with regards to this JIRA (
> https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr
> splitting on whitespace optional.
>
> I want to point out that there's not a simple fix to multi-term synonyms
> in part because of specific tradeoffs. Splitting on whitespace is *someimes
> a good thing*. Not splitting on whitespace (or enforcing some other
> cross-field consistent token splitting behavior) actually recreates an old
> problem that was the reason for creating dismax strategies in the first
> place. So I'm glad we're leaving the sow option :)
>
> If you're interested, this summarizes a bunch of historical research I did
> into Lucene code for my book for why splitting on whitespace is often a
> good thing
>
> Currently the behavior of edismax is intentionally designed to be
> term-centric. There's a bias towards having more of your query terms in a
> relevant hit. This comes out of an old problem called "albino elephant"
> that was the original reason dismax strategies came about. So if a user
> searches for
>
> albino elephant
>
> The original Lucene query parser for search across fields would do
> something like:
>
> (title:albino OR title:elephant) OR (text:albino OR text:elephant)
>
> TF*IDF held constant for each term, a document that matches "albino" in
> two fields has the same value as a document that matches BOTH albino and
> elephant. Both get 2 "hits" in the OR query above. Most users consder this
> not good! I want albino elephants, not just albino things nor just elephant
> things!
>
> So disjunctionmaxquery came about because somebody realized that if they
> took the per-term maximum, they could bias towards results that had more of
> the user's search terms.
>
> (title:albino | title:albino) OR (text:elephant | text:elephant)
>
> Here the highest scored result has BOTH search terms. So a result that has
> both elephant and albino will come to the top. What users typically expect.
>
> I call this strategy "term centric" -- it biases results towards documents
> with more of the users search terms. I contrast this with "field centric"
> search which focuses more on the specific analysis/matching behavior of one
> field (shingles/synonyms/auto phrasing/taxonomies/whatever)
>
> This strategy by necessity requires you to have a consistent, global
> definition of what's a "search term" independent of fields either by a
> common analyzer across fields or by just splitting on whitespace. A common
> analyzer is what BlendedTermQuery in Lucene enforces (used by ES's
> cross_field search)
>
> In other words splitting on whitespace has *benefits* and *drawbacks.* The
> drawback is what we experience with Solr multiterm synonyms. If you have
> one field that breaks up by shingles/some multi-term synonym behavior and
> another field that tokenizes on whitespace, you can't easily pick the
> document with the "most search terms" as there's no consistent definition
> of search terms.
>
> I don't know where I'm going with this, but I want to point out that
> fixing multiterm synonym won't have a silver bullet. People should still
> expect to be frustrated :). We should all be aware we likely recreate
> another problem with a simple fix to multiterm synonym. I think there's
> value in some strategy that does something like
>
> - Base relevance with edismax, splitting on whitespace to bias towards
> more search terms
> - Boosts with edismax w/o splitting on whitespace (or some other QP) to
> layer in the effects you want for multiterm synonyms
>
> How you balance these ranking signals is tricky and domain specific, but I
> have found this sort of strategy balances both concerns
>
> Ok this probably should have just been a blog post, but I wanted to just
> use my history degree for something useful for a change...
> Best!
> -Doug
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message