lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Emir Arnautović <emir.arnauto...@sematext.com>
Subject Re: Advice on Stemming in Solr
Date Fri, 03 Nov 2017 08:24:10 GMT
Hi Edwin,
Hunspell is configurable, language independent library and you can define any morphology rules.
It’s beed there for a while and I would not be surprised if someone already adjusted english
rules to suite you case.

Thanks,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 3 Nov 2017, at 04:25, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com> wrote:
> 
> Hi Emir,
> 
> We are looking to change to HunspellStemFilterFactory. This has a
> dictionary file containing words and applicable flags, and an affix file
> that specifies how these flags will control spell checking.
> Probably we can control it from those files in HunspellStemFilterFactory?
> 
> Regards,
> Edwin
> 
> 
> On 2 November 2017 at 17:46, Emir Arnautović <emir.arnautovic@sematext.com>
> wrote:
> 
>> Hi Edwin,
>> It seems that it would be best if you do not apply *ing stemming rule at
>> all. The first idea is to trick stemmer and replace any word that ends with
>> ing to some nonexisting char combination e.g. ‘wqx’. You can use solr.PatternReplaceFilterFactory
>> to do that. You can switch it back after stemming if want to have proper
>> token in index.
>> 
>> HTH,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 2 Nov 2017, at 03:23, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>> wrote:
>>> 
>>> Hi Emir,
>>> 
>>> We do have quite alot of words that should not be stemmed. Currently, the
>>> KStemFilterFactory are stemming all the non-English words that end with
>>> "ing" as well. There are quite alot of places and names which ends in
>>> "ing", and all these are being stemmed as well, which leads to an
>>> inaccurate search.
>>> 
>>> Regards,
>>> Edwin
>>> 
>>> 
>>> On 1 November 2017 at 18:20, Emir Arnautović <
>> emir.arnautovic@sematext.com>
>>> wrote:
>>> 
>>>> Hi Edwin,
>>>> If the number of words that should not be stemmed is not high you could
>>>> use KeywordMarkerFilterFactory to flag those words as keywords and it
>>>> should prevent stemmer from changing them.
>>>> Depending on what you want to achieve, you might not be able to avoid
>>>> using stemmer at indexing time. If you want to find documents that
>> contain
>>>> only “walking” with search term “walk”, then you have to stem at
index
>>>> time. Cases when you use stemming on query time only are rare and
>> specific.
>>>> If you want to prefer exact matches over stemmed matches, you have to
>>>> index same content with and without stemming and boost matches on field
>>>> without stemming.
>>>> 
>>>> HTH,
>>>> Emir
>>>> --
>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>>> 
>>>> 
>>>> 
>>>>> On 1 Nov 2017, at 10:11, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> We are currently using KStemFilterFactory in Solr, but we found that
it
>>>> is
>>>>> actually doing stemming on non-English words like "ximenting", which
it
>>>>> stem to "ximent". This is not what we wanted.
>>>>> 
>>>>> Another option is to use the HunspellStemFilterFactory, but there are
>>>> some
>>>>> English words like "running", walking" that are not being stemmed.
>>>>> 
>>>>> Would like to check, is it advisable to use Stemming at index? Or we
>>>> should
>>>>> not use Stemming at index time, but at query time, do a search for the
>>>>> stemmed words as well, like for example, if the user search for
>>>> "walking",
>>>>> we will do the search together with "walk", and the actual word of
>>>> walking
>>>>> will have higher weightage.
>>>>> 
>>>>> I'm currently using Solr 6.5.1.
>>>>> 
>>>>> Regards,
>>>>> Edwin
>>>> 
>>>> 
>> 
>> 


Mime
View raw message