From solr-user-return-149998-archive-asf-public=cust-asf.ponee.io@lucene.apache.org  Wed Oct  9 18:53:08 2019
Return-Path: <solr-user-return-149998-archive-asf-public=cust-asf.ponee.io@lucene.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 22ECA180645
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  9 Oct 2019 20:53:08 +0200 (CEST)
Received: (qmail 54238 invoked by uid 500); 9 Oct 2019 18:52:57 -0000
Mailing-List: contact solr-user-help@lucene.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:solr-user-help@lucene.apache.org>
List-Unsubscribe: <mailto:solr-user-unsubscribe@lucene.apache.org>
List-Post: <mailto:solr-user@lucene.apache.org>
List-Id: <solr-user.lucene.apache.org>
Reply-To: solr-user@lucene.apache.org
Delivered-To: mailing list solr-user@lucene.apache.org
Received: (qmail 53997 invoked by uid 99); 9 Oct 2019 18:52:57 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2019 18:52:57 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 50EB2C2274
	for <solr-user@lucene.apache.org>; Wed,  9 Oct 2019 18:52:56 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.8
X-Spam-Level: *
X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001,
	RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-ec2-va.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id 9e4uPcQuvgRk for <solr-user@lucene.apache.org>;
	Wed,  9 Oct 2019 18:52:52 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.221.43; helo=mail-wr1-f43.google.com; envelope-from=hastings.recursive@gmail.com; receiver=<UNKNOWN> 
Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com [209.85.221.43])
	by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 1F9B0BC8AC
	for <solr-user@lucene.apache.org>; Wed,  9 Oct 2019 18:52:52 +0000 (UTC)
Received: by mail-wr1-f43.google.com with SMTP id n14so4255729wrw.9
        for <solr-user@lucene.apache.org>; Wed, 09 Oct 2019 11:52:52 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to;
        bh=y6TsoEySSlaLEd9X6estMoHtiMYMolYXOjfFVg2Hlfc=;
        b=sAZ8jwN+0E+VQ1Lqgst+MV86aii1FlknSRs4UAaVOXlmbtg/sIWHX2sEvv4FGtHO60
         N7n9h4a6F5H4VXFBHSRcl4/aQWr3UadJfJ547L7J1VbWn5J5NFhY0mcBnq6Zq6r4rrX9
         qnfl+zi1q9pgWvmCobWTw/ksTF7rTkjGdbIK9yflyFyqeBa4hhE5ybZW6zPS8nHU2ooT
         Zn5+YrkhQSAaOrdVUv21ebRpXELSm/eOqQme5mKEYtO7NRneu/TSeNXwUiH5G9KqWeuF
         r1qjQO9eMFLjL7JOIpREEKqmEHolAmoOqO3bA7wJoDEQiGQjc3N2Xi/jeugtHKEe0Tf5
         sXVg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to;
        bh=y6TsoEySSlaLEd9X6estMoHtiMYMolYXOjfFVg2Hlfc=;
        b=k6OtQSJk/8ZsM6inf50eaoLtGRQEsa7CvTvX/StydXhtM/6Pe0JQugDdEd/mUBfX3B
         rn5M8DuUHoHDTfLRb2Y8HK34qMjt4vjw/NfpyzLmkd4DaetrGaDlISxVj2UiP902NDLL
         rYtwLqZ4vMBIZhHi+LV0cr7I3gK9I4MhgAobHmNABK7dd5yrfnCp5YEOcaLiwgj30TN5
         HqdZ3feM+eXd4KPbImqcpKFxmcpkcagCQwIFDdQiEteunLHRp79UXHj5pRrSU3nWvKFr
         ZGzBC7m4nbi6Li7KwGVMn1WXoOVhZ4cn3oj+4Vd+N9tO9rKmAXygfSpDt/0BoQ5N6r+q
         f5tA==
X-Gm-Message-State: APjAAAUOHzfOx6UsfS86grjh3+PpgVse3CQAuv7AbMQ5JTgUmsDCRLyB
	Fif9gZ30/moaQeDbK3fIXoKZ5ndn4xsq3UwaGrlUjQ==
X-Google-Smtp-Source: APXvYqxdldVigibscyEyvmoQXJQU3cPyWAZWbYFca/q4dbzmRap6PHhmfmBMpIqR5pMjPeRXYC78587ZYW7BtZsTfiM=
X-Received: by 2002:a5d:4b4a:: with SMTP id w10mr4219311wrs.60.1570647171029;
 Wed, 09 Oct 2019 11:52:51 -0700 (PDT)
MIME-Version: 1.0
References: <A289A6C7-F4EB-429D-9A4A-EA1A9DD85E94@ibm.com> <CAEFAe-E872OYf3dUGf85Y=j_Zbza-r8naKSetRuqx5k0iFmMQg@mail.gmail.com>
 <CANGA2_NgCm-MHibzezMkAOEcD1buAUqoXTiXcyBwFd_A=dEbww@mail.gmail.com>
 <FA865E96-CF4D-46E8-A735-C8B00B8BC143@ibm.com> <7CE3B0F1-89A0-473A-9E63-DD21EAAE7A17@gmail.com>
 <CANGA2_OyPjLb3oV3RF501MeXk4WuH7hSuGcSuAoVRzniD2gs5w@mail.gmail.com> <69BE8670-7C35-4FD3-9D46-6E68376C6605@wunderwood.org>
In-Reply-To: <69BE8670-7C35-4FD3-9D46-6E68376C6605@wunderwood.org>
From: David Hastings <hastings.recursive@gmail.com>
Date: Wed, 9 Oct 2019 14:52:40 -0400
Message-ID: <CANGA2_PtgSXqKQxspVUS0nW8d=NLQcPxe7-gaiQSZMy2eL1Jpw@mail.gmail.com>
Subject: Re: Protecting Tokens from Any Analysis
To: solr-user@lucene.apache.org
Content-Type: multipart/alternative; boundary="000000000000b25ef105947eca03"

--000000000000b25ef105947eca03
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Yeah, I dont use it as a search, only well, finding more documents like
that one :) . for my purposes i tested between 2 to 5 part shingles and
ended up that the 2 part was actually giving me better results, for my use
case, than using any more.

I dont suppose you could point me to any of the phrase IDF documentation
for solr by chance?  That would be fun to poke around with.

On Wed, Oct 9, 2019 at 2:49 PM Walter Underwood <wunder@wunderwood.org>
wrote:

> We did something like that with Infoseek and Ultraseek. We had a set of
> =E2=80=9Cglue words=E2=80=9D that made noun phrases and indexed patterns =
like =E2=80=9Cnoun glue
> noun=E2=80=9D
> as single tokens.
>
> I remember Doug Cutting saying that Nutch did something similar using
> pairs,
> but using that as a prefilter instead of as a relevance term.
>
> This is a way to get phrase IDF, which is pretty powerful stuff. Infoseek
> always
> beat Google in relevance tests, probably because of phrase IDF.
>
> More Like This could do the same thing, but it seems to be really slow an=
d
> not especially useful as a search component.
>
> wunder
> Walter Underwood
> wunder@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Oct 9, 2019, at 8:14 AM, David Hastings <hastings.recursive@gmail.co=
m>
> wrote:
> >
> > However, with all that said, stopwords CAN be useful in some
> situations.  I
> > combine stopwords with the shingle factory to create "interesting
> phrases"
> > (not really) that i use in "my more like this" needs.  for example,
> > europe for vacation
> > europe on vacation
> > will create the shingle
> > europe_vacation
> > which i can then use to relate other documents that would be much
> > more similar in such regard, rather than just using the "interesting
> words"
> > europe, vacation
> >
> > with stop words, the shingles would be
> > europe_for
> > for_vacation
> > and
> > europe_on
> > on_vacation
> >
> > just something to keep in mind,  theres a lot of creative ways to use
> > stopwords depending on your needs.  i use the above for a VERY basic ML
> > teacher and it works way better than using stopwords,
> >
> > On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <erickerickson@gmail.com=
>
> > wrote:
> >
> >> The theory behind stopwords is that they are =E2=80=9Csafe=E2=80=9D to=
 remove when
> >> calculating relevance, so we can squeeze every last bit of usefulness
> out
> >> of very constrained hardware (think 64K of memory. Yes kilobytes). We=
=E2=80=99ve
> >> come a long way since then and the necessity of removing stopwords fro=
m
> the
> >> indexed tokens to conserve RAM and disk is much less relevant than it
> used
> >> to be in =E2=80=9Cthe bad old days=E2=80=9D when the idea of stopwords=
 was invented.
> >>
> >> I=E2=80=99m not quite so confident as Alex that there is =E2=80=9Cno b=
enefit=E2=80=9D, but I=E2=80=99ll
> >> totally agree that you should remove stopwords only _after_ you have
> some
> >> evidence that removing them is A Good Thing in your situation.
> >>
> >> And removing stopwords leads to some interesting corner cases. Conside=
r
> a
> >> search for =E2=80=9Cto be or not to be=E2=80=9D if they=E2=80=99re all=
 stopwords.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
> >> Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com> wrote:
> >>>
> >>> Hey Alex,
> >>>
> >>> Thank you!
> >>>
> >>> Re: stopwords being a thing of the past due to the affordability of
> >> hardware...can you expand? I'm not sure I understand.
> >>>
> >>> --
> >>> Audrey Lorberfeld
> >>> Data Scientist, w3 Search
> >>> IBM
> >>> Audrey.Lorberfeld@IBM.com
> >>>
> >>>
> >>> =EF=BB=BFOn 10/8/19, 1:01 PM, "David Hastings" <hastings.recursive@gm=
ail.com>
> >> wrote:
> >>>
> >>>   Another thing to add to the above,
> >>>>
> >>>> IT:ibm. In this case, we would want to maintain the colon and the
> >>>> capitalization (otherwise =E2=80=9Cit=E2=80=9D would be taken out as=
 a stopword).
> >>>>
> >>>   stopwords are a thing of the past at this point.  there is no benef=
it
> >> to
> >>>   using them now with hardware being so cheap.
> >>>
> >>>   On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
> >> arafalov@gmail.com>
> >>>   wrote:
> >>>
> >>>> If you don't want it to be touched by a tokenizer, how would the
> >>>> protection step know that the sequence of characters you want to
> >>>> protect is "IT:ibm" and not "this is an IT:ibm term I want to
> >>>> protect"?
> >>>>
> >>>> What it sounds to me is that you may want to:
> >>>> 1) copyField to a second field
> >>>> 2) Apply a much lighter (whitespace?) tokenizer to that second field
> >>>> 3) Run the results through something like KeepWordFilterFactory
> >>>> 4) Search both fields with a boost on the second, higher-signal fiel=
d
> >>>>
> >>>> The other option is to run CharacterFilter,
> >>>> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map know=
n
> >>>> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
> >>>> term365". As long as it is done on both indexing and query, they wil=
l
> >>>> still match. You may have to have a bunch of them or write some sort
> >>>> of lookup map.
> >>>>
> >>>> Regards,
> >>>>  Alex.
> >>>>
> >>>> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
> >>>> Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com> wrote:
> >>>>>
> >>>>> Hi All,
> >>>>>
> >>>>> This is likely a rudimentary question, but I can=E2=80=99t seem to =
find a
> >>>> straight-forward answer on forums or the documentation=E2=80=A6is th=
ere a way
> to
> >>>> protect tokens from ANY analysis? I know things like the
> >>>> KeywordMarkerFilterFactory protect tokens from stemming, but we have
> >> some
> >>>> terms we don=E2=80=99t even want our tokenizer to touch. Mostly, the=
se are
> >>>> IBM-specific acronyms, such as IT:ibm. In this case, we would want t=
o
> >>>> maintain the colon and the capitalization (otherwise =E2=80=9Cit=E2=
=80=9D would be
> taken
> >>>> out as a stopword).
> >>>>>
> >>>>> Any advice is appreciated!
> >>>>>
> >>>>> Thank you,
> >>>>> Audrey
> >>>>>
> >>>>> --
> >>>>> Audrey Lorberfeld
> >>>>> Data Scientist, w3 Search
> >>>>> IBM
> >>>>> Audrey.Lorberfeld@IBM.com
> >>>>>
> >>>>
> >>>
> >>>
> >>
> >>
>
>

--000000000000b25ef105947eca03--