lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Otis Gospodnetic (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SOLR-319) changes SynonymFilterFactory to "Analyze" synonyms file
Date Tue, 26 Apr 2011 18:23:03 GMT

     [ https://issues.apache.org/jira/browse/SOLR-319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Otis Gospodnetic updated SOLR-319:
----------------------------------

    Summary: changes SynonymFilterFactory to "Analyze" synonyms file  (was: changes SynonymFilterFactoryto
"Analyze" synonyms file)

> changes SynonymFilterFactory to "Analyze" synonyms file
> -------------------------------------------------------
>
>                 Key: SOLR-319
>                 URL: https://issues.apache.org/jira/browse/SOLR-319
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 1.3
>
>         Attachments: SOLR-319.patch, SOLR-319.patch, SOLR-319.patch
>
>
> WHAT:
> Currently, SynonymFilterFactory works very well with N-gram tokenizer (CJKTokenizer,
for example).
> But we have to take care of the statement in synonyms.txt.
> For example, if I use CJKTokenizer (work as bi-gram for CJK chars) and want C1C2C3 maps
to C4C5C6,
> I have to write the rule as follows:
> C1C2 C2C3 => C4C5 C5C6
> But I want to write it "C1C2C3=>C4C5C6". This patch allows it. It is also helpful
for sharing synonyms.txt.
> HOW:
> tokenFactory attribute is added to <filter class="solr.SynonymFilterFactory"/>.
> If the attribute is specified, SynonymFilterFactory uses the TokenizerFactory to create
Tokenizer.
> Then SynonymFilterFactory uses the Tokenizer to get tokens from the rules in synonyms.txt
file.
> sample-1: CJKTokenizer
>     <fieldtype name="text_cjk" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="ngram_synonym_test_ja.txt"
>         		ignoreCase="true" expand="true" tokenFactory="solr.CJKTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.CJKTokenizerFactory"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> sample-2: NGramTokenizer
>     <fieldtype name="text_ngram" class="solr.TextField" positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
>         <filter class="solr.SynonymFilterFactory" synonyms="ngram_synonym_test_ngram.txt"
>         		ignoreCase="true" expand="true"
>         		tokenFactory="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="2"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>       </analyzer>
>     </fieldtype>
> backward compatibility:
> Yes. If you omit tokenFactory attribute from <filter class="solr.SynonymFilterFactory"/>
tag, it works as usual.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message