From solr-user-return-149998-archive-asf-public=cust-asf.ponee.io@lucene.apache.org Wed Oct 9 18:53:08 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 22ECA180645 for ; Wed, 9 Oct 2019 20:53:08 +0200 (CEST) Received: (qmail 54238 invoked by uid 500); 9 Oct 2019 18:52:57 -0000 Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: solr-user@lucene.apache.org Delivered-To: mailing list solr-user@lucene.apache.org Received: (qmail 53997 invoked by uid 99); 9 Oct 2019 18:52:57 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2019 18:52:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 50EB2C2274 for ; Wed, 9 Oct 2019 18:52:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id 9e4uPcQuvgRk for ; Wed, 9 Oct 2019 18:52:52 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.221.43; helo=mail-wr1-f43.google.com; envelope-from=hastings.recursive@gmail.com; receiver= Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com [209.85.221.43]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 1F9B0BC8AC for ; Wed, 9 Oct 2019 18:52:52 +0000 (UTC) Received: by mail-wr1-f43.google.com with SMTP id n14so4255729wrw.9 for ; Wed, 09 Oct 2019 11:52:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=y6TsoEySSlaLEd9X6estMoHtiMYMolYXOjfFVg2Hlfc=; b=sAZ8jwN+0E+VQ1Lqgst+MV86aii1FlknSRs4UAaVOXlmbtg/sIWHX2sEvv4FGtHO60 N7n9h4a6F5H4VXFBHSRcl4/aQWr3UadJfJ547L7J1VbWn5J5NFhY0mcBnq6Zq6r4rrX9 qnfl+zi1q9pgWvmCobWTw/ksTF7rTkjGdbIK9yflyFyqeBa4hhE5ybZW6zPS8nHU2ooT Zn5+YrkhQSAaOrdVUv21ebRpXELSm/eOqQme5mKEYtO7NRneu/TSeNXwUiH5G9KqWeuF r1qjQO9eMFLjL7JOIpREEKqmEHolAmoOqO3bA7wJoDEQiGQjc3N2Xi/jeugtHKEe0Tf5 sXVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=y6TsoEySSlaLEd9X6estMoHtiMYMolYXOjfFVg2Hlfc=; b=k6OtQSJk/8ZsM6inf50eaoLtGRQEsa7CvTvX/StydXhtM/6Pe0JQugDdEd/mUBfX3B rn5M8DuUHoHDTfLRb2Y8HK34qMjt4vjw/NfpyzLmkd4DaetrGaDlISxVj2UiP902NDLL rYtwLqZ4vMBIZhHi+LV0cr7I3gK9I4MhgAobHmNABK7dd5yrfnCp5YEOcaLiwgj30TN5 HqdZ3feM+eXd4KPbImqcpKFxmcpkcagCQwIFDdQiEteunLHRp79UXHj5pRrSU3nWvKFr ZGzBC7m4nbi6Li7KwGVMn1WXoOVhZ4cn3oj+4Vd+N9tO9rKmAXygfSpDt/0BoQ5N6r+q f5tA== X-Gm-Message-State: APjAAAUOHzfOx6UsfS86grjh3+PpgVse3CQAuv7AbMQ5JTgUmsDCRLyB Fif9gZ30/moaQeDbK3fIXoKZ5ndn4xsq3UwaGrlUjQ== X-Google-Smtp-Source: APXvYqxdldVigibscyEyvmoQXJQU3cPyWAZWbYFca/q4dbzmRap6PHhmfmBMpIqR5pMjPeRXYC78587ZYW7BtZsTfiM= X-Received: by 2002:a5d:4b4a:: with SMTP id w10mr4219311wrs.60.1570647171029; Wed, 09 Oct 2019 11:52:51 -0700 (PDT) MIME-Version: 1.0 References: <7CE3B0F1-89A0-473A-9E63-DD21EAAE7A17@gmail.com> <69BE8670-7C35-4FD3-9D46-6E68376C6605@wunderwood.org> In-Reply-To: <69BE8670-7C35-4FD3-9D46-6E68376C6605@wunderwood.org> From: David Hastings Date: Wed, 9 Oct 2019 14:52:40 -0400 Message-ID: Subject: Re: Protecting Tokens from Any Analysis To: solr-user@lucene.apache.org Content-Type: multipart/alternative; boundary="000000000000b25ef105947eca03" --000000000000b25ef105947eca03 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Yeah, I dont use it as a search, only well, finding more documents like that one :) . for my purposes i tested between 2 to 5 part shingles and ended up that the 2 part was actually giving me better results, for my use case, than using any more. I dont suppose you could point me to any of the phrase IDF documentation for solr by chance? That would be fun to poke around with. On Wed, Oct 9, 2019 at 2:49 PM Walter Underwood wrote: > We did something like that with Infoseek and Ultraseek. We had a set of > =E2=80=9Cglue words=E2=80=9D that made noun phrases and indexed patterns = like =E2=80=9Cnoun glue > noun=E2=80=9D > as single tokens. > > I remember Doug Cutting saying that Nutch did something similar using > pairs, > but using that as a prefilter instead of as a relevance term. > > This is a way to get phrase IDF, which is pretty powerful stuff. Infoseek > always > beat Google in relevance tests, probably because of phrase IDF. > > More Like This could do the same thing, but it seems to be really slow an= d > not especially useful as a search component. > > wunder > Walter Underwood > wunder@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Oct 9, 2019, at 8:14 AM, David Hastings > wrote: > > > > However, with all that said, stopwords CAN be useful in some > situations. I > > combine stopwords with the shingle factory to create "interesting > phrases" > > (not really) that i use in "my more like this" needs. for example, > > europe for vacation > > europe on vacation > > will create the shingle > > europe_vacation > > which i can then use to relate other documents that would be much > > more similar in such regard, rather than just using the "interesting > words" > > europe, vacation > > > > with stop words, the shingles would be > > europe_for > > for_vacation > > and > > europe_on > > on_vacation > > > > just something to keep in mind, theres a lot of creative ways to use > > stopwords depending on your needs. i use the above for a VERY basic ML > > teacher and it works way better than using stopwords, > > > > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson > > wrote: > > > >> The theory behind stopwords is that they are =E2=80=9Csafe=E2=80=9D to= remove when > >> calculating relevance, so we can squeeze every last bit of usefulness > out > >> of very constrained hardware (think 64K of memory. Yes kilobytes). We= =E2=80=99ve > >> come a long way since then and the necessity of removing stopwords fro= m > the > >> indexed tokens to conserve RAM and disk is much less relevant than it > used > >> to be in =E2=80=9Cthe bad old days=E2=80=9D when the idea of stopwords= was invented. > >> > >> I=E2=80=99m not quite so confident as Alex that there is =E2=80=9Cno b= enefit=E2=80=9D, but I=E2=80=99ll > >> totally agree that you should remove stopwords only _after_ you have > some > >> evidence that removing them is A Good Thing in your situation. > >> > >> And removing stopwords leads to some interesting corner cases. Conside= r > a > >> search for =E2=80=9Cto be or not to be=E2=80=9D if they=E2=80=99re all= stopwords. > >> > >> Best, > >> Erick > >> > >>> On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld - > >> Audrey.Lorberfeld@ibm.com wrote: > >>> > >>> Hey Alex, > >>> > >>> Thank you! > >>> > >>> Re: stopwords being a thing of the past due to the affordability of > >> hardware...can you expand? I'm not sure I understand. > >>> > >>> -- > >>> Audrey Lorberfeld > >>> Data Scientist, w3 Search > >>> IBM > >>> Audrey.Lorberfeld@IBM.com > >>> > >>> > >>> =EF=BB=BFOn 10/8/19, 1:01 PM, "David Hastings" > >> wrote: > >>> > >>> Another thing to add to the above, > >>>> > >>>> IT:ibm. In this case, we would want to maintain the colon and the > >>>> capitalization (otherwise =E2=80=9Cit=E2=80=9D would be taken out as= a stopword). > >>>> > >>> stopwords are a thing of the past at this point. there is no benef= it > >> to > >>> using them now with hardware being so cheap. > >>> > >>> On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch < > >> arafalov@gmail.com> > >>> wrote: > >>> > >>>> If you don't want it to be touched by a tokenizer, how would the > >>>> protection step know that the sequence of characters you want to > >>>> protect is "IT:ibm" and not "this is an IT:ibm term I want to > >>>> protect"? > >>>> > >>>> What it sounds to me is that you may want to: > >>>> 1) copyField to a second field > >>>> 2) Apply a much lighter (whitespace?) tokenizer to that second field > >>>> 3) Run the results through something like KeepWordFilterFactory > >>>> 4) Search both fields with a boost on the second, higher-signal fiel= d > >>>> > >>>> The other option is to run CharacterFilter, > >>>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map know= n > >>>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm -> > >>>> term365". As long as it is done on both indexing and query, they wil= l > >>>> still match. You may have to have a bunch of them or write some sort > >>>> of lookup map. > >>>> > >>>> Regards, > >>>> Alex. > >>>> > >>>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld - > >>>> Audrey.Lorberfeld@ibm.com wrote: > >>>>> > >>>>> Hi All, > >>>>> > >>>>> This is likely a rudimentary question, but I can=E2=80=99t seem to = find a > >>>> straight-forward answer on forums or the documentation=E2=80=A6is th= ere a way > to > >>>> protect tokens from ANY analysis? I know things like the > >>>> KeywordMarkerFilterFactory protect tokens from stemming, but we have > >> some > >>>> terms we don=E2=80=99t even want our tokenizer to touch. Mostly, the= se are > >>>> IBM-specific acronyms, such as IT:ibm. In this case, we would want t= o > >>>> maintain the colon and the capitalization (otherwise =E2=80=9Cit=E2= =80=9D would be > taken > >>>> out as a stopword). > >>>>> > >>>>> Any advice is appreciated! > >>>>> > >>>>> Thank you, > >>>>> Audrey > >>>>> > >>>>> -- > >>>>> Audrey Lorberfeld > >>>>> Data Scientist, w3 Search > >>>>> IBM > >>>>> Audrey.Lorberfeld@IBM.com > >>>>> > >>>> > >>> > >>> > >> > >> > > --000000000000b25ef105947eca03--