lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Searching for the '+' character
Date Mon, 14 Sep 2009 16:01:25 GMT
Before you go too much further with this, I've just got to ask whetherthe
use case for searching "product+" really serves your customers.
If you mess around with analyzers to make things include the "+",
what does that mean for "&"? "*"? "."? any other weird character
you can think of?

Would it be a bad thing for "product" to match "product+" and vice
versa? Would it be more or less confusing for your users to have "product"
FAIL to match "product+"?

Of course only you really know your problem space, but think carefully
about this issue before you take on the work of making "product+" work
because it'll inevitably be waaaay more work than you think. Imagine the
bug reports when "product&" fails to match "product+", both of which
fail to match "product"....

I'd also get a copy of Luke and look at the index to be sure what you
is in there is *actually* there. It'll also help you understand what
do better.

Don't forget that using different analyzers when indexing and querying will
lead"interesting" results.


On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <> wrote:

> Thanks Ahmet,
> Thats excellent, thanks :) I may have to increase the gramsize to take into
> account other possible uses but i can now read around these filters to make
> the adjustments.
> With regard to WordDelimiterFilterFactory. Is there a way to place a
> delimiter on this filter to still get most of its functionality without it
> absorbing the + signs? Will i loose a lot of 'good' functionality by
> removing it? 'preserveOriginal' sounds promising and seems to work but is it
> a good idea to use this?
> On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:
>> --- On Mon, 9/14/09, Paul Forsyth <> wrote:
>>  From: Paul Forsyth <>
>>> Subject: Re: Searching for the '+' character
>>> To:
>>> Date: Monday, September 14, 2009, 5:55 PM
>>> With words like 'product+' i'd expect
>>> a search for '+' to return results like any other character
>>> or word, so '+' would be found within 'product+' or similar
>>> text.
>>> I've tried removing the worddelimiter from the query
>>> analyzer, restarting and reindexing but i get the same
>>> result. Nothing is found. I assume one of the filters could
>>> be adjusted to keep the '+'.
>>> Weird thing is that i tried to remove all filters from the
>>> analyzer and i get the same result.
>>> Paul
>> When you remove all filters '+' is kept, but still '+' won't match
>> 'product+'. Because you want to search inside a token.
>> If + sign is always at the end of of your text, and you want to search
>> only last character of your text EdgeNGramFilterFactory can do that.
>> with the settings side="back" maxGramSize="1" minGramSize="1"
>> The fieldType below will match '+' to 'product+'
>> <fieldType name="textx" class="solr.TextField" positionIncrementGap="100">
>>     <analyzer type="index">
>>       <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="ISOLatin1AccentFilterFactory"/>
>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"/>
>>        <filter class="solr.EdgeNGramFilterFactory" side="back"
>> maxGramSize="1" minGramSize="1"/>
>>     </analyzer>
>>     <analyzer type="query">
>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>       <filter class="solr.LowerCaseFilterFactory"/>
>>       <filter class="ISOLatin1AccentFilterFactory"/>
>>       <filter class="solr.SnowballPorterFilterFactory"
>> language="English"/>
>>     </analyzer>
>>   </fieldType>
>> But this time 'product+' will be reduced to only '+'. You won't be able to
>> search it otherways for example product*. Along with the last character, if
>> you want to keep the original word it self you can set maxGramSize to 512.
>> By doing this token 'product+' will produce 8 tokens: (and query product* or
>> product+ will return it )
>> + word
>> t+ word
>> ct+ word
>> uct+ word
>> duct+ word
>> oduct+ word
>> roduct+ word
>> product+ word
>> If + sign can be anywhere inside the text you can use NGramTokenFilter.
>> Hope this helps.
> Best regards,
> Paul Forsyth
> mail:
> skype: paulforsyth

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message