lucene-solr-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Koji Sekiguchi (JIRA)" <j...@apache.org>
Subject [jira] Created: (SOLR-319) changes SynonymFilterFactory for N-gram tokenizer
Date Thu, 26 Jul 2007 03:04:31 GMT
changes SynonymFilterFactory for N-gram tokenizer
-------------------------------------------------

                 Key: SOLR-319
                 URL: https://issues.apache.org/jira/browse/SOLR-319
             Project: Solr
          Issue Type: Improvement
            Reporter: Koji Sekiguchi
            Priority: Minor


WHAT:
Currently, SynonymFilterFactory works very well with N-gram tokenizer (CJKTokenizer, for example).
But we have to take care of the statement in synonyms.txt.
For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want C1C2C3 maps to
C4C5C6,
I have to write the rule as follows:

C1C2 C2C3 => C4C5 C5C6

But I want to write it "C1C2C3=>C4C5C6". This patch allows it. It is also helpful for sharing
synonyms.txt.

HOW:
tokenFactory attribute is added to <filter class="solr.SynonymFilterFactory"/>.
If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory to create Tokenizer.
Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in synonyms.txt
file.

sample-1: CJKTokenizer

    <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.CJKTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="ngram_synonym_test_ja.txt"
        		ignoreCase="true" expand="true" tokenFactory="solr.CJKTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.CJKTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

sample-2: NGramTokenizer

    <fieldtype name="text_ngram" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
        <filter class="solr.SynonymFilterFactory" synonyms="ngram_synonym_test_ngram.txt"
        		ignoreCase="true" expand="true"
        		tokenFactory="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldtype>

backward compatibility:
Yes. If you omit tokenFactory attribute from <filter class="solr.SynonymFilterFactory"/>
tag, it works as usual.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message