lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dmitry Kan <dmitry....@gmail.com>
Subject Re: Searching on fields with White Spaces
Date Wed, 25 Apr 2012 12:25:15 GMT
Problem here is that e.g. New York is stored as two different tokens in
your index, as you use white space tokenizer. The easiest solution would be
to detect and break the incoming one-word query tokens into several tokens,
i.e. newyork => new york. That's probably possible only if there is a
finite and known list of such cases (is it?)

Another way is to use ShingleFilterFactory (ngrams), that will split input
into ngrams, but again, that should be run instead of white space tokenizer
(you can make a copy field and experiment via analysis page).

Here is what I did for a quick experiment:

<fieldType name="shingle_text_fivegram" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="5"
outputUnigrams="true" tokenSeparator=""/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
    </fieldType>

...

<field name="shingleContent_fivegram" type="shingle_text_fivegram"
indexed="true" stored="true" omitNorms="true"
omitTermFreqAndPositions="true" />


notice tokenSeparator (applies as of solr 3.1, see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ShingleFilterFactory).
For new york the following tokens get produced:

org.apache.solr.analysis.ShingleFilterFactory {outputUnigrams=true,
maxShingleSize=5, tokenSeparator=, luceneMatchVersion=LUCENE_34}position12term
textnewyorknewyorkstartOffset040endOffset388typewordwordshingle
so it should meet both requirements to find new york and newyork.

Test it for you case and see if it is optimal both for speed and index size.

Dmitry


On Tue, Apr 24, 2012 at 3:39 PM, Shubham Srivastava <
Shubham.Srivastava@makemytrip.com> wrote:

> I have a custom fieldtype with the below config
>
> <fieldType name="text_ngram" class="solr.TextField"
> positionIncrementGap="100" autoGeneratePhraseQueries="true">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="10" />
> <filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone"
> inject="true"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="10" />
>      </analyzer>
>    </fieldType>
>
>
> I have an Autocomplete configured on the same field which gives me result
> as expected. A new use case is to search kualalumpur or say newyork with
> out spaces returning Kuala Lumpur and New York which happen to be the
> original values.
>
> What should be the recommended solution.
>
> Regards,
> Shubham
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message