lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Walter Underwood <wun...@wunderwood.org>
Subject Re: Not highlighting "and" and "or"?
Date Thu, 29 Jun 2017 21:14:16 GMT
My blog post has a list of movie titles. I forgot to list the TV series “Once and Again”.

Some bands that are not searchable with stopwords:

* The Who
* Was (not Was)
* The The

wunder
Walter Underwood
wunder@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Jun 29, 2017, at 2:09 PM, Erick Erickson <erickerickson@gmail.com> wrote:
> 
> bq: Mostly, stopwords were a performance hack back when people ran
> search engines on 16-bit machines
> 
> Ah, _those_ were the days when programmers were _real_ programmers.
> Actually I'm glad they're gone but that's another story.
> 
> "to be or not to be". Can't search that if you enable stopwords.
> 
> Chris Hostetter wrote a fun blog on the fact that Lucene query parsers
> are not strict boolean logic with the title "Why Not AND, OR, And NOT"
> purposely choosing that title as it's totally unsearchable if you're
> using stopwords.
> 
> FWIW,
> Erick
> 
> On Thu, Jun 29, 2017 at 1:57 PM, David Hastings
> <hastings.recursive@gmail.com> wrote:
>> Agreed.  Stop words from the moment I started using them caused complaints
>> and problems right off the bat.  They may have been implemented less than a
>> week before needing a re-index to fix all the problems they caused.
>> 
>> On Thu, Jun 29, 2017 at 4:55 PM, Walter Underwood <wunder@wunderwood.org>
>> wrote:
>> 
>>> Ultraseek (and Infoseek) never used stopwords. They cause odd failures,
>>> like not being able to search for “Vitamin A”.
>>> 
>>> Stopwords are an on/off approach to term frequency. idf is a proportional
>>> approach. Once you have idf, you don’t need stopwords.
>>> 
>>> When I was bringing up Solr for Netflix, I started with an analysis chain
>>> that used stopwords. A surprising number of movie titles entirely
>>> disappeared. I wrote a blog post about it. Ten years ago!
>>> 
>>> https://observer.wunderwood.org/2007/05/31/do-all-stopword-queries-matter/
>>> 
>>> Mostly, stopwords were a performance hack back when people ran search
>>> engines on 16-bit machines. Neither disks nor RAM were big enough to hold
>>> the posting lists for common words.
>>> 
>>> wunder
>>> Walter Underwood
>>> wunder@wunderwood.org
>>> http://observer.wunderwood.org/  (my blog)
>>> 
>>> 
>>>> On Jun 29, 2017, at 1:46 PM, Rick Leir <rleir@leirtech.com> wrote:
>>>> 
>>>> Walter
>>>> Sorry for the tangent, but the stopwords feature sounds useful. You say
>>> you do not use this? Did Ultraseek not do it either?
>>>> Thanks
>>>> Rick
>>>> 
>>>> On June 29, 2017 10:53:42 AM EDT, Walter Underwood <
>>> wunder@wunderwood.org> wrote:
>>>>> Nope. Haven’t used stopwords for the last 20 years.
>>>>> 
>>>>> I wonder if lowercaseOperators is true. The docs don’t give the default
>>>>> value for that in edismax.
>>>>> 
>>>>> https://lucene.apache.org/solr/guide/6_6/the-extended-
>>> dismax-query-parser.html
>>>>> 
>>>>> wunder
>>>>> Walter Underwood
>>>>> wunder@wunderwood.org
>>>>> http://observer.wunderwood.org/  (my blog)
>>>>> 
>>>>> 
>>>>>> On Jun 29, 2017, at 4:42 AM, Rick Leir <rleir@leirtech.com>
wrote:
>>>>>> 
>>>>>> Stopwords?
>>>>>> 
>>>>>> On June 28, 2017 5:13:43 PM EDT, Walter Underwood
>>>>> <wunder@wunderwood.org> wrote:
>>>>>>> Is there some special casing in the highlighter to skip query
syntax
>>>>>>> words? The words “and” and “or” don’t get highlighted.
>>>>>>> 
>>>>>>> This is in 6.5.0.
>>>>>>> 
>>>>>>>    <str name="hl.fl">question</str>
>>>>>>>    <str name="hl.encoder">html</str>
>>>>>>>    <str name="hl.fragsize">440</str>
>>>>>>>    <str name="hl.method">fastVector</str>
>>>>>>>    <str name="hl.snippets">1</str>
>>>>>>> 
>>>>>>> wunder
>>>>>>> Walter Underwood
>>>>>>> wunder@wunderwood.org
>>>>>>> http://observer.wunderwood.org/  (my blog)
>>>>>> 
>>>>>> --
>>>>>> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>>>> 
>>>> --
>>>> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>>>> --
>>>> Sorry for being brief. Alternate email is rickleir at yahoo dot com
>>> 
>>> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message