Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id B522B200C46 for ; Wed, 29 Mar 2017 16:46:07 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id B36E2160B8A; Wed, 29 Mar 2017 14:46:07 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id D4B7B160B7C for ; Wed, 29 Mar 2017 16:46:06 +0200 (CEST) Received: (qmail 49752 invoked by uid 500); 29 Mar 2017 14:46:05 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 49740 invoked by uid 99); 29 Mar 2017 14:46:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 29 Mar 2017 14:46:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 92FE31806B4 for ; Wed, 29 Mar 2017 14:46:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.379 X-Spam-Level: ** X-Spam-Status: No, score=2.379 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01, RCVD_IN_SORBS_SPAM=0.5, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=opensourceconnections.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id wICNuuoNMeJx for ; Wed, 29 Mar 2017 14:46:02 +0000 (UTC) Received: from mail-it0-f44.google.com (mail-it0-f44.google.com [209.85.214.44]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 33FD95F1B3 for ; Wed, 29 Mar 2017 14:46:02 +0000 (UTC) Received: by mail-it0-f44.google.com with SMTP id e75so94919325itd.1 for ; Wed, 29 Mar 2017 07:46:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=opensourceconnections.com; s=google; h=mime-version:from:date:message-id:subject:to; bh=reCfoZVz546KuZfp4hOjoOoleoNSLgjKfCCOS/nMma0=; b=uyDk3hwiZ2yTvfkHirahGyLebrdLTSjrjJ4zne4W/oJmbsxCKD0lQV38c4hbD1JXeg n4J+1fYaZPWYILES52aH0BcyWqR4AeRc1JcWbejg+nXtFDIc2xNcqtIxtadXbP8czD7F stDWIZ7Ia16KUD/4b2JS6SEq9C+WxNA2bhXzc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=reCfoZVz546KuZfp4hOjoOoleoNSLgjKfCCOS/nMma0=; b=Avf1+SvEsuRQunPp7wXvcRcexwZ2+XOdzZOohrYcWk3wQRS4fAXotPpL6ZprKFRT3Q +X/UlSh3EtC+Hn0XJFiGwa3tDmMmt8MWlksPTOifSj05y9imrgoPnykB8VqTksszoTqX 9+mZaieezIFtJpvgzgywJUYl17snqxyJxmuSj2lVHnM5Atny35E/7idL+Cdb2Qv9avlG aLDOjfa+nSnLv5cSn90fKrTnOe88t+6JUnUzFWEDzPdYs2Ag+na0CqBbx2HuswTjT8R0 HZHszMtbuRq612u2qeJfh8nQz1PwQhMNyNNHborU/s2iKoYoWDRLnJZXDAYRsOSoPgo0 K47w== X-Gm-Message-State: AFeK/H2e0gloGTr0jisqhvcMvfmYlZiBU7HqYGC5a6Dd6YKb4eCU70fmX+V9931H2dQEzl/YiriE+2dFn0lUhQ== X-Received: by 10.36.34.135 with SMTP id o129mr1462456ito.70.1490798761318; Wed, 29 Mar 2017 07:46:01 -0700 (PDT) MIME-Version: 1.0 From: Doug Turnbull Date: Wed, 29 Mar 2017 14:45:50 +0000 Message-ID: Subject: The downsides of not splitting on whitespace in edismax (the old albino elephant prob) To: "solr-user@lucene.apache.org" Content-Type: multipart/alternative; boundary=001a1144d1ba99f791054bdfa26e archived-at: Wed, 29 Mar 2017 14:46:07 -0000 --001a1144d1ba99f791054bdfa26e Content-Type: text/plain; charset=UTF-8 So with regards to this JIRA ( https://issues.apache.org/jira/browse/SOLR-9185) Which makes Solr splitting on whitespace optional. I want to point out that there's not a simple fix to multi-term synonyms in part because of specific tradeoffs. Splitting on whitespace is *someimes a good thing*. Not splitting on whitespace (or enforcing some other cross-field consistent token splitting behavior) actually recreates an old problem that was the reason for creating dismax strategies in the first place. So I'm glad we're leaving the sow option :) If you're interested, this summarizes a bunch of historical research I did into Lucene code for my book for why splitting on whitespace is often a good thing Currently the behavior of edismax is intentionally designed to be term-centric. There's a bias towards having more of your query terms in a relevant hit. This comes out of an old problem called "albino elephant" that was the original reason dismax strategies came about. So if a user searches for albino elephant The original Lucene query parser for search across fields would do something like: (title:albino OR title:elephant) OR (text:albino OR text:elephant) TF*IDF held constant for each term, a document that matches "albino" in two fields has the same value as a document that matches BOTH albino and elephant. Both get 2 "hits" in the OR query above. Most users consder this not good! I want albino elephants, not just albino things nor just elephant things! So disjunctionmaxquery came about because somebody realized that if they took the per-term maximum, they could bias towards results that had more of the user's search terms. (title:albino | title:albino) OR (text:elephant | text:elephant) Here the highest scored result has BOTH search terms. So a result that has both elephant and albino will come to the top. What users typically expect. I call this strategy "term centric" -- it biases results towards documents with more of the users search terms. I contrast this with "field centric" search which focuses more on the specific analysis/matching behavior of one field (shingles/synonyms/auto phrasing/taxonomies/whatever) This strategy by necessity requires you to have a consistent, global definition of what's a "search term" independent of fields either by a common analyzer across fields or by just splitting on whitespace. A common analyzer is what BlendedTermQuery in Lucene enforces (used by ES's cross_field search) In other words splitting on whitespace has *benefits* and *drawbacks.* The drawback is what we experience with Solr multiterm synonyms. If you have one field that breaks up by shingles/some multi-term synonym behavior and another field that tokenizes on whitespace, you can't easily pick the document with the "most search terms" as there's no consistent definition of search terms. I don't know where I'm going with this, but I want to point out that fixing multiterm synonym won't have a silver bullet. People should still expect to be frustrated :). We should all be aware we likely recreate another problem with a simple fix to multiterm synonym. I think there's value in some strategy that does something like - Base relevance with edismax, splitting on whitespace to bias towards more search terms - Boosts with edismax w/o splitting on whitespace (or some other QP) to layer in the effects you want for multiterm synonyms How you balance these ranking signals is tricky and domain specific, but I have found this sort of strategy balances both concerns Ok this probably should have just been a blog post, but I wanted to just use my history degree for something useful for a change... Best! -Doug --001a1144d1ba99f791054bdfa26e--