lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Alex Willmer <al.will...@logica.com>
Subject Re: StandardTokenizer and domain names containing digits
Date Mon, 23 Apr 2012 09:34:54 GMT
Steven A Rowe <sarowe <at> syr.edu> writes:
> StandardTokenizer in Lucene/Solr v3.1+ implements the Word Boundary rules from 
Unicode 6.0.0 Standard
> Annex #29, a.k.a. UAX#29: <http://www.unicode.org/reports/tr29/tr29-
17.html#Word_Boundaries>. 
> These rules don't include recognition of URLs or domain names.
> 
> Lucene/Solr includes another tokenizer that does recognize URLs and domain 
names, in addition to the
> UAX#29 Word Boundary rules: UAX29URLEmailTokenizer
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailT
okenizerFactory>.
>  (Stand-alone domain names are recognized as URLs.)
> 
> My suggestion is that you add a filter (for both the indexing and querying) 
that splits tokens containing
> periods:
> 
<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterF
ilterFactory>,
> something like (untested!):
> 
>     <filter class="solr.WordDelimiterFilterFactory"
>             splitOnCaseChange="0"
>             splitOnNumerics="0"
>             stemEnglishPossessive="0"
>             generateWordParts="1"
>             preserveOriginal="1" />

Steve, Thank you very much for this reply, it helped immensely. In the end I've 
gone for your suggestion, plus a swap of StandardTokenizer -> 
UAX29URLEmailTokenizer and setting autoGeneratePhraseQueries="true". The 
fieldType now looks like

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" 
autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true"
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory"
            synonyms="index_synonyms.txt" ignoreCase="true" 
            expand="false"/>
    -->
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory"
            splitOnCaseChange="1"
            splitOnNumerics="0"
            stemEnglishPossessive="0"
            generateWordParts="1"
            preserveOriginal="1" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" 
            words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" 
            synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

autoGeneratePhraseQueries is set so that the tokens generated in the query 
analyzer behave more like tokens from a space delimited query. So 
"ns1.define.logica.com" finds a similar set of documents to "ns1 define logica 
com" (i.e. "ns1 AND define AND logica AND com"), rather than "ns1 OR define OR 
logica OR com". 

Many thanks, Alex


Mime
View raw message