lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nawab Zada Asad Iqbal <khi...@gmail.com>
Subject Small Tokenization issue
Date Wed, 03 Jan 2018 20:04:07 GMT
Hi,

So, I have a string for indexing:

abc - def (notice the space on either side of hyphen)

which is being processed with this filter-list:-


    <fieldType name="shingle" class="solr.TextField"
positionIncrementGap="100">
      <analyzer type="index">
        <charFilter
class="org.apache.lucene.analysis.icu.ICUNormalizer2CharFilterFactory"
name="nfkc" mode="compose"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterGraphFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" preserveOriginal="0"
splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="0"/>
        <filter class="solr.FlattenGraphFilterFactory"/>
        <filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="false" fillerToken=""/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.LimitTokenCountFilterFactory"
maxTokenCount="10000" consumeAllTokens="false"/>
        <filter class="solr.LengthFilterFactory" min="1" max="255"/>
      </analyzer>


I get two shingle tokens at the end:

"abc" "def"

I want to get "abc def" . What can I tweak to get this?


Thanks
Nawab

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message