lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From lee carroll <lee.a.carr...@googlemail.com>
Subject keepword file with phrases
Date Sat, 05 Feb 2011 16:08:49 GMT
Hi List
I'm trying to achieve the following

text in "this aisle contains preserves and savoury spreads"

desired index entry for a field to be used for faceting (ie strict set of
normalised terms)
is "jams" "savoury spreads" ie two facet terms

current set up for the field is

<fieldType name="facet" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.KeepWordFilterFactory"
words="goodForKeepWords.txt" ignoreCase="true"/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.ShingleFilterFactory" maxShingleSize="2"
outputUnigrams="true"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="goodForSynonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.KeepWordFilterFactory"
words="goodForKeepWords.txt" ignoreCase="true"/>
      </analyzer>
    </fieldType>

The thinking here is
get rid of any mark up nonsense
split into tokens based on whitespace => "this" "aisle" "contains"
"preserves" "and" "savoury" "spreads"
produce shingles of 1 or 2 tokens => "this","this aisle", "aisle", "aisle
contains", "contains", "contains preserves","preserves","and",
                                                      "and savoury",
"savoury", "savoury spreads", "spreads"

expand synonyms using a synomym file (preserves -> jam) =>

"this","this aisle", "aisle", "aisle contains", "contains","contains
preserves","preserves","jam","and","and savoury", "savoury", "savoury
spreads", "spreads"

produce a normalised term list using a keepword file of jam , "savoury
spreads" in it

which should place "jam" "savoury spreads" into the index field facet.....

However i don't get savoury spreads in the index. from the analysis.jsp
everything goes to plan upto the last step where the keepword file does not
like keeping the phrase "savoury spreads". i've tried niavely quoting the
phrase in the keepword file :-)

What is the best way to achive the above ? Is this the correct approach or
is there a better way ?

thanks in advance lee

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message