lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Forsyth ...@ez.no>
Subject Re: Searching for the '+' character
Date Mon, 14 Sep 2009 16:08:09 GMT
Hi Erick,

In this specific case my client does have a new product with a '+' at  
the end. Its just one of those odd ones!

Customers are expected to put + into the search box so i have to have  
results to show.

I hear your concerns though. Originally i thought I would need to  
transform the + into something else, and do this back and forwards to  
get a match!

Hopefully this will be a standard solr install, but with this tweak  
for escaped chars....

Paul

On 14 Sep 2009, at 17:01, Erick Erickson wrote:

> Before you go too much further with this, I've just got to ask  
> whetherthe
> use case for searching "product+" really serves your customers.
> If you mess around with analyzers to make things include the "+",
> what does that mean for "&"? "*"? "."? any other weird character
> you can think of?
>
> Would it be a bad thing for "product" to match "product+" and vice
> versa? Would it be more or less confusing for your users to have  
> "product"
> FAIL to match "product+"?
>
> Of course only you really know your problem space, but think carefully
> about this issue before you take on the work of making "product+" work
> because it'll inevitably be waaaay more work than you think. Imagine  
> the
> bug reports when "product&" fails to match "product+", both of which
> fail to match "product"....
>
> I'd also get a copy of Luke and look at the index to be sure what you
> *think*
> is in there is *actually* there. It'll also help you understand what
> analyzers
> do better.
>
> Don't forget that using different analyzers when indexing and  
> querying will
> lead to...er..."interesting" results.
>
> Best
> Erick
>
> On Mon, Sep 14, 2009 at 11:38 AM, Paul Forsyth <pf@ez.no> wrote:
>
>> Thanks Ahmet,
>>
>> Thats excellent, thanks :) I may have to increase the gramsize to  
>> take into
>> account other possible uses but i can now read around these filters  
>> to make
>> the adjustments.
>>
>> With regard to WordDelimiterFilterFactory. Is there a way to place a
>> delimiter on this filter to still get most of its functionality  
>> without it
>> absorbing the + signs? Will i loose a lot of 'good' functionality by
>> removing it? 'preserveOriginal' sounds promising and seems to work  
>> but is it
>> a good idea to use this?
>>
>>
>> On 14 Sep 2009, at 16:16, AHMET ARSLAN wrote:
>>
>>
>>>
>>> --- On Mon, 9/14/09, Paul Forsyth <pf@ez.no> wrote:
>>>
>>> From: Paul Forsyth <pf@ez.no>
>>>> Subject: Re: Searching for the '+' character
>>>> To: solr-user@lucene.apache.org
>>>> Date: Monday, September 14, 2009, 5:55 PM
>>>> With words like 'product+' i'd expect
>>>> a search for '+' to return results like any other character
>>>> or word, so '+' would be found within 'product+' or similar
>>>> text.
>>>>
>>>> I've tried removing the worddelimiter from the query
>>>> analyzer, restarting and reindexing but i get the same
>>>> result. Nothing is found. I assume one of the filters could
>>>> be adjusted to keep the '+'.
>>>>
>>>> Weird thing is that i tried to remove all filters from the
>>>> analyzer and i get the same result.
>>>>
>>>> Paul
>>>>
>>>
>>> When you remove all filters '+' is kept, but still '+' won't match
>>> 'product+'. Because you want to search inside a token.
>>>
>>> If + sign is always at the end of of your text, and you want to  
>>> search
>>> only last character of your text EdgeNGramFilterFactory can do that.
>>> with the settings side="back" maxGramSize="1" minGramSize="1"
>>>
>>> The fieldType below will match '+' to 'product+'
>>>
>>> <fieldType name="textx" class="solr.TextField"  
>>> positionIncrementGap="100">
>>>    <analyzer type="index">
>>>      <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="ISOLatin1AccentFilterFactory"/>
>>>      <filter class="solr.SnowballPorterFilterFactory"
>>> language="English"/>
>>>       <filter class="solr.EdgeNGramFilterFactory" side="back"
>>> maxGramSize="1" minGramSize="1"/>
>>>    </analyzer>
>>>    <analyzer type="query">
>>>      <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>      <filter class="solr.SynonymFilterFactory"  
>>> synonyms="synonyms.txt"
>>> ignoreCase="true" expand="true"/>
>>>      <filter class="solr.LowerCaseFilterFactory"/>
>>>      <filter class="ISOLatin1AccentFilterFactory"/>
>>>      <filter class="solr.SnowballPorterFilterFactory"
>>> language="English"/>
>>>    </analyzer>
>>>  </fieldType>
>>>
>>>
>>> But this time 'product+' will be reduced to only '+'. You won't be  
>>> able to
>>> search it otherways for example product*. Along with the last  
>>> character, if
>>> you want to keep the original word it self you can set maxGramSize  
>>> to 512.
>>> By doing this token 'product+' will produce 8 tokens: (and query  
>>> product* or
>>> product+ will return it )
>>>
>>> + word
>>> t+ word
>>> ct+ word
>>> uct+ word
>>> duct+ word
>>> oduct+ word
>>> roduct+ word
>>> product+ word
>>>
>>> If + sign can be anywhere inside the text you can use  
>>> NGramTokenFilter.
>>> Hope this helps.
>>>
>>>
>>>
>>>
>> Best regards,
>>
>> Paul Forsyth
>>
>> mail: pf@ez.no
>> skype: paulforsyth
>>
>>

Best regards,

Paul Forsyth

mail: pf@ez.no
skype: paulforsyth


Mime
View raw message