lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Utkarsh Sengar <utkarsh2...@gmail.com>
Subject Re: What filter to use to search with spaces omitted/included between words?
Date Tue, 20 Aug 2013 23:48:15 GMT
Thanks Tamanjit and Erick.
I tried out the filters, most of the usecases work except "q=bestbuy". As
mentioned by Erick, that is a hard one to crack.

I am looking into DictionaryCompoundWordTokenFilterFactory but compound
words like these:
http://www.manythings.org/vocabulary/lists/a/words.php?f=compound_words and
generic english words, it won't cover my need of custom compound words of
store names like BestBuy, WalMart or CirtuitCity.

Thanks,
-Utkarsh


On Tue, Aug 20, 2013 at 4:43 AM, Jack Krupansky <jack@basetechnology.com>wrote:

> You could either have a synonym filter to replace "bestbuy" with "best
> buy" or use DictionaryCompoundWordTokenFil**terFactory to do the same.
>
> See:
> http://lucene.apache.org/core/**4_4_0/analyzers-common/org/**
> apache/lucene/analysis/**compound/**DictionaryCompoundWordTokenFil**
> terFactory.html<http://lucene.apache.org/core/4_4_0/analyzers-common/org/apache/lucene/analysis/compound/DictionaryCompoundWordTokenFilterFactory.html>
>
> There are some examples in my book, but they are for German compound words
> since that was the original primary intent for this filter. But it should
> work for any words since it is a simple dictionary.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Erick Erickson
> Sent: Tuesday, August 20, 2013 7:21 AM
> To: solr-user@lucene.apache.org
> Subject: Re: What filter to use to search with spaces omitted/included
> between words?
>
>
> Also consider WordDelimterFilterFactory, which will break up the
> tokens on upper/lower case transitions.
>
> to get relevance, consider edismax-style query parsers and adding
> automatic phrase generation (with boosts usually).
>
> This one will be a problem:
> q=bestbuy
>
> There's no good generic way to get this to split up. One
> possibility is to use synonyms if the list is known, but
> otherwise there's no information here to distinguish it
> from "legitimate" words.
>
> edgeNgrams work on _tokens_, not words so I doubt
> they would help in this case either since there is only
> one token.
>
> Best
> Erick
>
>
> On Tue, Aug 20, 2013 at 3:16 AM, tamanjit.bindra@yahoo.co.in <
> tamanjit.bindra@yahoo.co.in> wrote:
>
>  Additionally, if you dont want results like q=best and result=bestbuy; you
>> can use <charFilter class="solr.**PatternReplaceCharFilterFactor**y"
>> pattern="\W+" replacement=""/> to actually replace whitespaces with
>> nothing.
>>
>>
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
>> s#CharFilterFactories<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories>
>> <
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**
>> s#CharFilterFactories<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#CharFilterFactories>
>> >
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.**nabble.com/What-filter-to-use-**
>> to-search-with-spaces-omitted-**included-between-words-**
>> tp4085576p4085601.html<http://lucene.472066.n3.nabble.com/What-filter-to-use-to-search-with-spaces-omitted-included-between-words-tp4085576p4085601.html>
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>


-- 
Thanks,
-Utkarsh

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message