lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Hastings <hastings.recurs...@gmail.com>
Subject Re: Re: Re: Protecting Tokens from Any Analysis
Date Wed, 09 Oct 2019 18:42:21 GMT
only in my more like this tools, but they have a very specific purpose,
otherwise no

On Wed, Oct 9, 2019 at 2:31 PM Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com
<Audrey.Lorberfeld@ibm.com> wrote:

> Wow, thank you so much, everyone. This is all incredibly helpful insight.
>
> So, would it be fair to say that the majority of you all do NOT use stop
> words?
>
> --
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> Audrey.Lorberfeld@IBM.com
>
>
> On 10/9/19, 11:14 AM, "David Hastings" <hastings.recursive@gmail.com>
> wrote:
>
>     However, with all that said, stopwords CAN be useful in some
> situations.  I
>     combine stopwords with the shingle factory to create "interesting
> phrases"
>     (not really) that i use in "my more like this" needs.  for example,
>     europe for vacation
>     europe on vacation
>     will create the shingle
>     europe_vacation
>     which i can then use to relate other documents that would be much
>     more similar in such regard, rather than just using the "interesting
> words"
>     europe, vacation
>
>     with stop words, the shingles would be
>     europe_for
>     for_vacation
>     and
>     europe_on
>     on_vacation
>
>     just something to keep in mind,  theres a lot of creative ways to use
>     stopwords depending on your needs.  i use the above for a VERY basic ML
>     teacher and it works way better than using stopwords,
>
>
>
>
>
>
>
>
>
>
>
>
>
>     On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
> erickerickson@gmail.com>
>     wrote:
>
>     > The theory behind stopwords is that they are “safe” to remove when
>     > calculating relevance, so we can squeeze every last bit of
> usefulness out
>     > of very constrained hardware (think 64K of memory. Yes kilobytes).
> We’ve
>     > come a long way since then and the necessity of removing stopwords
> from the
>     > indexed tokens to conserve RAM and disk is much less relevant than
> it used
>     > to be in “the bad old days” when the idea of stopwords was invented.
>     >
>     > I’m not quite so confident as Alex that there is “no benefit”, but
> I’ll
>     > totally agree that you should remove stopwords only _after_ you have
> some
>     > evidence that removing them is A Good Thing in your situation.
>     >
>     > And removing stopwords leads to some interesting corner cases.
> Consider a
>     > search for “to be or not to be” if they’re all stopwords.
>     >
>     > Best,
>     > Erick
>     >
>     > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>     > Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com> wrote:
>     > >
>     > > Hey Alex,
>     > >
>     > > Thank you!
>     > >
>     > > Re: stopwords being a thing of the past due to the affordability of
>     > hardware...can you expand? I'm not sure I understand.
>     > >
>     > > --
>     > > Audrey Lorberfeld
>     > > Data Scientist, w3 Search
>     > > IBM
>     > > Audrey.Lorberfeld@IBM.com
>     > >
>     > >
>     > > On 10/8/19, 1:01 PM, "David Hastings" <
> hastings.recursive@gmail.com>
>     > wrote:
>     > >
>     > >    Another thing to add to the above,
>     > >>
>     > >> IT:ibm. In this case, we would want to maintain the colon and the
>     > >> capitalization (otherwise “it” would be taken out as a stopword).
>     > >>
>     > >    stopwords are a thing of the past at this point.  there is no
> benefit
>     > to
>     > >    using them now with hardware being so cheap.
>     > >
>     > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
>     > arafalov@gmail.com>
>     > >    wrote:
>     > >
>     > >> If you don't want it to be touched by a tokenizer, how would the
>     > >> protection step know that the sequence of characters you want to
>     > >> protect is "IT:ibm" and not "this is an IT:ibm term I want to
>     > >> protect"?
>     > >>
>     > >> What it sounds to me is that you may want to:
>     > >> 1) copyField to a second field
>     > >> 2) Apply a much lighter (whitespace?) tokenizer to that second
> field
>     > >> 3) Run the results through something like KeepWordFilterFactory
>     > >> 4) Search both fields with a boost on the second, higher-signal
> field
>     > >>
>     > >> The other option is to run CharacterFilter,
>     > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to map
> known
>     > >> complex acronyms to non-tokenizable substitutions. E.g. "IT:ibm ->
>     > >> term365". As long as it is done on both indexing and query, they
> will
>     > >> still match. You may have to have a bunch of them or write some
> sort
>     > >> of lookup map.
>     > >>
>     > >> Regards,
>     > >>   Alex.
>     > >>
>     > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>     > >> Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com> wrote:
>     > >>>
>     > >>> Hi All,
>     > >>>
>     > >>> This is likely a rudimentary question, but I can’t seem to find
a
>     > >> straight-forward answer on forums or the documentation…is there a
> way to
>     > >> protect tokens from ANY analysis? I know things like the
>     > >> KeywordMarkerFilterFactory protect tokens from stemming, but we
> have
>     > some
>     > >> terms we don’t even want our tokenizer to touch. Mostly, these are
>     > >> IBM-specific acronyms, such as IT:ibm. In this case, we would
> want to
>     > >> maintain the colon and the capitalization (otherwise “it” would
> be taken
>     > >> out as a stopword).
>     > >>>
>     > >>> Any advice is appreciated!
>     > >>>
>     > >>> Thank you,
>     > >>> Audrey
>     > >>>
>     > >>> --
>     > >>> Audrey Lorberfeld
>     > >>> Data Scientist, w3 Search
>     > >>> IBM
>     > >>> Audrey.Lorberfeld@IBM.com
>     > >>>
>     > >>
>     > >
>     > >
>     >
>     >
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message