lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alexandre Rafalovitch <arafa...@gmail.com>
Subject Re: Of, To, and Other Small Words
Date Tue, 15 Jul 2014 01:51:36 GMT
You could try experimenting with CommonGramsFilterFactory and
CommonGramsQueryFilter (slightly different). There is actually a lot
of cool analyzers bundled with Solr. You can find full list on my site
at: http://www.solr-start.com/info/analyzers

Regards,
   Alex.
Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


On Tue, Jul 15, 2014 at 8:42 AM, Teague James <teaguej@insystechinc.com> wrote:
> Alex,
>
> Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. Taking
that out of the mix did it.
>
> -Teague
>
> -----Original Message-----
> From: Alexandre Rafalovitch [mailto:arafalov@gmail.com]
> Sent: Monday, July 14, 2014 9:14 PM
> To: solr-user
> Subject: Re: Of, To, and Other Small Words
>
> Have you tried the Admin UI's Analyze screen. Because it will show you what happens to
the text as it progresses through the tokenizers and filters. No need to reindex.
>
> Regards,
>    Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/
and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On Tue, Jul 15, 2014 at 8:10 AM, Teague James <teaguej@insystechinc.com> wrote:
>> Hi Anshum,
>>
>> Thanks for replying and suggesting this, but the field type I am using (a modified
text_general) in my schema has the file set to 'stopwords.txt'.
>>
>>         <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>>                 <analyzer type="index">
>>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>>                         <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
>>                         <!-- in this example, we will only use synonyms at query
time
>>                         <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt"
ignoreCase="true" expand="false"/>-->
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <!-- CHANGE: The NGramFilterFactory was added to provide
partial word search. This can be changed to
>>                         EdgeNGramFilterFactory side="front" to only match front sided
partial searches if matching any
>>                         part of a word is undesireable.-->
>>                         <filter class="solr.NGramFilterFactory" minGramSize="3"
maxGramSize="10" />
>>                         <!-- CHANGE: The PorterStemFilterFactory was added to
allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>                         <filter class="solr.PorterStemFilterFactory"/>
>>                 </analyzer>
>>                 <analyzer type="query">
>>                         <tokenizer class="solr.StandardTokenizerFactory"/>
>>                         <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" />
>>                         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
>>                         <filter class="solr.LowerCaseFilterFactory"/>
>>                         <!-- CHANGE: The PorterStemFilterFactory was added to
allow matches for 'cat' and 'cats' by searching for 'cat' -->
>>                         <filter class="solr.PorterStemFilterFactory"/>
>>                 </analyzer>
>>         </fieldType>
>>
>> Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed,
and searched with still zero results. Any other suggestions on where I might be able to control
this behavior?
>>
>> -Teague
>>
>>
>> -----Original Message-----
>> From: Anshum Gupta [mailto:anshum@anshumgupta.net]
>> Sent: Monday, July 14, 2014 4:04 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Of, To, and Other Small Words
>>
>> Hi Teague,
>>
>> The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt
(which wouldn't be empty if you check).
>> What you're looking at is the stopword.txt. You could either empty that file out
or change the field type for your field.
>>
>>
>> On Mon, Jul 14, 2014 at 12:53 PM, Teague James <teaguej@insystechinc.com> wrote:
>>> Hello all,
>>>
>>> I am working with Solr 4.9.0 and am searching for phrases that
>>> contain words like "of" or "to" that Solr seems to be ignoring at index time.
>>> Here's what I tried:
>>>
>>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml"
>>> --data-binary '<add><doc><field name="id">100</field><field
>>> name="content">blah blah blah knowledge of science blah blah
>>> blah</field></doc></add>'
>>>
>>> Then, using a broswer:
>>>
>>> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=
>>> i
>>> d:100
>>>
>>> I get zero hits. Search for "knowledge" or "science" and I'll get hits.
>>> "knowledge of" or "of science" and I get zero hits. I don't want to
>>> use proximity if I can avoid it, as this may introduce too many
>>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring "of"
and "to"
>>> and possibly more words that I have not discovered through testing
>>> yet. Is there some other configuration file that contains these small
>>> words? Is there any way to force Solr to pay attention to them and
>>> not drop them from the phrase? Any advice is appreciated! Thanks!
>>>
>>> -Teague
>>>
>>>
>>
>>
>>
>> --
>>
>> Anshum Gupta
>> http://www.anshumgupta.net
>>
>

Mime
View raw message