lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Hastings <hastings.recurs...@gmail.com>
Subject Re: Re: Re: Re: Protecting Tokens from Any Analysis
Date Wed, 09 Oct 2019 19:00:36 GMT
oh and by 'non stop' i mean close enough for me :)

On Wed, Oct 9, 2019 at 2:59 PM David Hastings <hastings.recursive@gmail.com>
wrote:

> if you have anything close to a decent server you wont notice it all.  im
> at about 21 million documents, index varies between 450gb to 800gb
> depending on merges, and about 60k searches a day and stay sub second non
> stop, and this is on a single core/non cloud environment
>
> On Wed, Oct 9, 2019 at 2:55 PM Audrey Lorberfeld -
> Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com> wrote:
>
>> Also, in terms of computational cost, it would seem that including most
>> terms/not having a stop ilst would take a toll on the system. For instance,
>> right now we have "ibm" as a stop word because it appears everywhere in our
>> corpus. If we did not include it in the stop words file, we would have to
>> retrieve every single document in our corpus and rank them. That's a high
>> computational cost, no?
>>
>> --
>> Audrey Lorberfeld
>> Data Scientist, w3 Search
>> IBM
>> Audrey.Lorberfeld@IBM.com
>>
>>
>> On 10/9/19, 2:31 PM, "Audrey Lorberfeld - Audrey.Lorberfeld@ibm.com" <
>> Audrey.Lorberfeld@ibm.com> wrote:
>>
>>     Wow, thank you so much, everyone. This is all incredibly helpful
>> insight.
>>
>>     So, would it be fair to say that the majority of you all do NOT use
>> stop words?
>>
>>     --
>>     Audrey Lorberfeld
>>     Data Scientist, w3 Search
>>     IBM
>>     Audrey.Lorberfeld@IBM.com
>>
>>
>>     On 10/9/19, 11:14 AM, "David Hastings" <hastings.recursive@gmail.com>
>> wrote:
>>
>>         However, with all that said, stopwords CAN be useful in some
>> situations.  I
>>         combine stopwords with the shingle factory to create "interesting
>> phrases"
>>         (not really) that i use in "my more like this" needs.  for
>> example,
>>         europe for vacation
>>         europe on vacation
>>         will create the shingle
>>         europe_vacation
>>         which i can then use to relate other documents that would be much
>>         more similar in such regard, rather than just using the
>> "interesting words"
>>         europe, vacation
>>
>>         with stop words, the shingles would be
>>         europe_for
>>         for_vacation
>>         and
>>         europe_on
>>         on_vacation
>>
>>         just something to keep in mind,  theres a lot of creative ways to
>> use
>>         stopwords depending on your needs.  i use the above for a VERY
>> basic ML
>>         teacher and it works way better than using stopwords,
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>         On Wed, Oct 9, 2019 at 10:51 AM Erick Erickson <
>> erickerickson@gmail.com>
>>         wrote:
>>
>>         > The theory behind stopwords is that they are “safe” to remove
>> when
>>         > calculating relevance, so we can squeeze every last bit of
>> usefulness out
>>         > of very constrained hardware (think 64K of memory. Yes
>> kilobytes). We’ve
>>         > come a long way since then and the necessity of removing
>> stopwords from the
>>         > indexed tokens to conserve RAM and disk is much less relevant
>> than it used
>>         > to be in “the bad old days” when the idea of stopwords was
>> invented.
>>         >
>>         > I’m not quite so confident as Alex that there is “no benefit”,
>> but I’ll
>>         > totally agree that you should remove stopwords only _after_ you
>> have some
>>         > evidence that removing them is A Good Thing in your situation.
>>         >
>>         > And removing stopwords leads to some interesting corner cases.
>> Consider a
>>         > search for “to be or not to be” if they’re all stopwords.
>>         >
>>         > Best,
>>         > Erick
>>         >
>>         > > On Oct 9, 2019, at 9:38 AM, Audrey Lorberfeld -
>>         > Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com> wrote:
>>         > >
>>         > > Hey Alex,
>>         > >
>>         > > Thank you!
>>         > >
>>         > > Re: stopwords being a thing of the past due to the
>> affordability of
>>         > hardware...can you expand? I'm not sure I understand.
>>         > >
>>         > > --
>>         > > Audrey Lorberfeld
>>         > > Data Scientist, w3 Search
>>         > > IBM
>>         > > Audrey.Lorberfeld@IBM.com
>>         > >
>>         > >
>>         > > On 10/8/19, 1:01 PM, "David Hastings" <
>> hastings.recursive@gmail.com>
>>         > wrote:
>>         > >
>>         > >    Another thing to add to the above,
>>         > >>
>>         > >> IT:ibm. In this case, we would want to maintain the colon
>> and the
>>         > >> capitalization (otherwise “it” would be taken out as a
>> stopword).
>>         > >>
>>         > >    stopwords are a thing of the past at this point.  there is
>> no benefit
>>         > to
>>         > >    using them now with hardware being so cheap.
>>         > >
>>         > >    On Tue, Oct 8, 2019 at 12:43 PM Alexandre Rafalovitch <
>>         > arafalov@gmail.com>
>>         > >    wrote:
>>         > >
>>         > >> If you don't want it to be touched by a tokenizer, how would
>> the
>>         > >> protection step know that the sequence of characters you
>> want to
>>         > >> protect is "IT:ibm" and not "this is an IT:ibm term I want
to
>>         > >> protect"?
>>         > >>
>>         > >> What it sounds to me is that you may want to:
>>         > >> 1) copyField to a second field
>>         > >> 2) Apply a much lighter (whitespace?) tokenizer to that
>> second field
>>         > >> 3) Run the results through something like
>> KeepWordFilterFactory
>>         > >> 4) Search both fields with a boost on the second,
>> higher-signal field
>>         > >>
>>         > >> The other option is to run CharacterFilter,
>>         > >> (PatternReplaceCharFilterFactory) which is pre-tokenizer to
>> map known
>>         > >> complex acronyms to non-tokenizable substitutions. E.g.
>> "IT:ibm ->
>>         > >> term365". As long as it is done on both indexing and query,
>> they will
>>         > >> still match. You may have to have a bunch of them or write
>> some sort
>>         > >> of lookup map.
>>         > >>
>>         > >> Regards,
>>         > >>   Alex.
>>         > >>
>>         > >> On Tue, 8 Oct 2019 at 12:10, Audrey Lorberfeld -
>>         > >> Audrey.Lorberfeld@ibm.com <Audrey.Lorberfeld@ibm.com>
wrote:
>>         > >>>
>>         > >>> Hi All,
>>         > >>>
>>         > >>> This is likely a rudimentary question, but I can’t seem
to
>> find a
>>         > >> straight-forward answer on forums or the documentation…is
>> there a way to
>>         > >> protect tokens from ANY analysis? I know things like the
>>         > >> KeywordMarkerFilterFactory protect tokens from stemming, but
>> we have
>>         > some
>>         > >> terms we don’t even want our tokenizer to touch. Mostly,
>> these are
>>         > >> IBM-specific acronyms, such as IT:ibm. In this case, we
>> would want to
>>         > >> maintain the colon and the capitalization (otherwise “it”
>> would be taken
>>         > >> out as a stopword).
>>         > >>>
>>         > >>> Any advice is appreciated!
>>         > >>>
>>         > >>> Thank you,
>>         > >>> Audrey
>>         > >>>
>>         > >>> --
>>         > >>> Audrey Lorberfeld
>>         > >>> Data Scientist, w3 Search
>>         > >>> IBM
>>         > >>> Audrey.Lorberfeld@IBM.com
>>         > >>>
>>         > >>
>>         > >
>>         > >
>>         >
>>         >
>>
>>
>>
>>
>>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message