From solr-user-return-149880-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Mon Sep 30 23:49:44 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 73CCA180656 for ; Tue, 1 Oct 2019 01:49:44 +0200 (CEST) Received: (qmail 83127 invoked by uid 500); 30 Sep 2019 23:49:38 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 83110 invoked by uid 99); 30 Sep 2019 23:49:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Sep 2019 23:49:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id A7443C20A4 for ; Mon, 30 Sep 2019 23:49:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.21 X-Spam-Level: ** X-Spam-Status: No, score=2.21 tagged_above=-999 required=6.31 tests=[DKIM_INVALID=0.1, DKIM_SIGNED=0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_REMOTE_IMAGE=0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=neutral reason="invalid (public key: DNS error: query timed out)" header.d=canva.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id 2l9kCL4qGFfB for ; Mon, 30 Sep 2019 23:49:32 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::234; helo=mail-oi1-x234.google.com; envelope-from=ashwin@canva.com; receiver= Received: from mail-oi1-x234.google.com (mail-oi1-x234.google.com [IPv6:2607:f8b0:4864:20::234]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id CDB4E7DDE1 for ; Mon, 30 Sep 2019 23:24:21 +0000 (UTC) Received: by mail-oi1-x234.google.com with SMTP id o205so12706094oib.12 for ; Mon, 30 Sep 2019 16:24:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=canva.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=ehZJ8T6mM49zAh5RTntZ4nIfghgzUmX+pr/IExD/JZo=; b=T5zYjJWMltlW82cQepsniaWol5vQXPFuJcsQpi9n5a9XFPUW/FO0/ByFJ1v2etcVFz UQkwAESKKW3EFYs2egqzjrMKvb2arvRikT8uDJehD4goSgmgpQEzgP1mEhzwVTH3aWH4 BfaGDOGnGi0ZhkTdl93+pYuGDBYS9dGUrNu6I= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=ehZJ8T6mM49zAh5RTntZ4nIfghgzUmX+pr/IExD/JZo=; b=pjXxJ93uLDOTTfbVeGR//PVQn+yOvSCo/SIPcCTGGTfUspB8zUOvo3nLpA8/MJ6Dy3 x6IJLwWxn2zVL+plo/kBQ8gArL2aZ7ZenrD7R/w55utHOmYvPc1iexpsU1wRdBiqdQSj oXeDxRaRNhBPlWpw+PduY7wEYa2w21hhx1yS4AX8xPdpiuWwDFS1uz9T+8rNHKocNeR/ 5koRrVdg+iZOM6IWljBo635SS949059jqYNzTroeL3vohNxOybFtBv892bObp5pBxsXy vOrPmcmp6ikNruTWjqoE6ARYjNtGV6fCOcPnd48Iu5hpLARTguJIRLmcDuDifew9q1cI 4lag== X-Gm-Message-State: APjAAAXXHcrdn6hmXWgkZU+PGgPsfTPS4KIJyVS3O8Wj/klU3Lb85EVg umGT7F9SgAvdZ6xROBzqdu9qu1s5kT3laNuaLDqpcJYIgBccO3a32D5PHeKrp/2LawrpXfJ5lr5 9GD/0g2zYECMpxKs4xWTt4G7dpx26ZXk/Vt1b1w== X-Google-Smtp-Source: APXvYqz000nKWnbW8kuUH1pFmxee2PHirFIqM7VfvLlSBU/6N4rMc/RtG2p+fnnOofKJiCVCm/vzBjBEGDQyd5+Hr/4= X-Received: by 2002:aca:1a16:: with SMTP id a22mr1394783oia.49.1569885860223; Mon, 30 Sep 2019 16:24:20 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Ashwin Ramesh Date: Tue, 1 Oct 2019 09:24:09 +1000 Message-ID: Subject: Re: Dealing with multi-word keywords and SOW=true To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary="000000000000097c300593cd893b" --000000000000097c300593cd893b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks Erick, that seems to work! Should I leave it in qf also? For example the query "blue dog" may be represented as separate tokens in the keyword index. On Mon, Sep 30, 2019 at 9:32 PM Erick Erickson wrote: > Have you tried taking your keyword field out of the =E2=80=9Cqf=E2=80=9D = param and adding > it explicitly? As keyword:=E2=80=9Dice cream=E2=80=9D > > Best, > Erick > > > On Sep 30, 2019, at 5:27 AM, Ashwin Ramesh wrote: > > > > Hi everybody, > > > > I am using the edismax parser and have noticed a very specific behaviou= r > > with how sow=3Dtrue (default) handles multiword keywords. > > > > We have a field called 'keywords', which uses the general > > KeywordTokenizerFactory. There are also other text fields like title an= d > > description. etc. > > > > When we index a document with a keyword "ice cream", for example, we kn= ow > > it gets indexed into that field as "ice cream". > > > > However, at query time, I noticed that if we run an Edismax query: > > q=3Dice cream > > qf=3Dkeywords > > > > I do not get that document back as a match. This is due to sow=3Dtrue > > splitting the user's query and the final tokens not being present in th= e > > keywords field. > > > > I was wondering what the best practise around this was? Some thoughts I > > have had: > > > > 1. Index multi-word keywords with hyphens or somelike similar. E.g. "ic= e > > cream" -> "ice-cream" > > 2. Additionally index the separate words as keywords also. E.g. "ice > cream" > > -> "ice cream", "ice", "cream". However this method will result in the > loss > > of intent (q=3Dice would return this document). > > 3. Add a boost query which is an edismax query where we explicitly set > > sow=3Dfalse and add a huge boost. E.g*. bq=3D{!edismax qf=3Dkeywords^10= 00 > > sow=3Dfalse bq=3D"" boost=3D"" pf=3D"" tie=3D1.00 v=3D"ice cream"}* > > > > Is there an industry practise solution to handle this type of problem? > Keep > > in mind that the other text fields may also include these terms. E.g. > > title=3D"This is ice cream", which would match the query. This specific > > problem affects the keywords field for the obvious reason that the > indexing > > pipeline does not tokenize keywords. > > > > Thank you for all your amazing help, > > > > Regards, > > > > Ash > > > > -- > > *P.S. We've launched a new blog to share the latest ideas and case > studies > > from our team. Check it out here: product.canva.com > > . *** > > ** Empowering the > > world to design > > Also, we're hiring. Apply here! > > > > > > > > > > > > > > > > > > > > > > > > --=20 *P.S. We've launched a new blog to share the latest ideas and case studies= =20 from our team. Check it out here:=C2=A0product.canva.com=20 .=C2=A0*** ** Empowering the=20 world to=C2=A0design Also, we're hiring.=C2=A0Apply here!=20 =20 =20 =C2=A0 =C2=A0=20 =C2=A0 --000000000000097c300593cd893b--