lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jfmel...@free.fr
Subject Re: Iso accents and wildcards
Date Fri, 30 Oct 2009 16:11:54 GMT
if the request contains any wilcard then filters are not called :
no ISOLatin1AccentFilterFactory and no SnowballPorterFilterFactory  !

"économie" is indexed to "econom"

solr don't found :
 - term starts with "éco"     (éco*)
 - term starts with "economi" (economi*)

if you index manger, mangé and mangue, the indexed terms will be mang and mangu

requests  ->  results

manger   ->   mange, mangé
mangé    ->   mange, mangé
mang     ->   mange, manger
mangu    ->   mangue
mang*    ->   manger, mangé, mangue
mang?    ->   mangue  (and not mangé)
mangé*   ->   nothing

Jean-François


----- "Nicolas Leconte" <nicolas.aidel@aidel.com> a écrit :

| Hi all,
| 
| I have a field that contains accentuated char in it, what I whant is
| to 
| be able to search with ignore accents.
| I have set up that field with :
| <analyzer>
| <tokenizer class="solr.StandardTokenizerFactory"/>
| <filter class="solr.StandardFilterFactory"/>
| <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
| 
| generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
| catenateAll="0" splitOnCaseChange="1" />
| <filter class="solr.LowerCaseFilterFactory"/>
| <filter class="solr.StopFilterFactory" ignoreCase="true" 
| words="stopwords.txt" />
| <filter class="solr.SnowballPorterFilterFactory" language="French"/>
| <filter class="solr.LowerCaseFilterFactory"/>
| <filter class="solr.ISOLatin1AccentFilterFactory"/>
| <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
| </analyzer>
| 
| In the index the word "économie" is translated to  "econom", the 
| accent 
| is removed thanks to the ISOLatin1AccentFilterFactory and the end of
| the 
| word removent thanks to the SnowballPorterFilterFactory.
| 
| When I request with title:econ* I can have the correct  answers, but
| if  
| I request  with  title:écon*  I  have no  answers.
| If I request with title:économ (the exact word of the index) it works,
| 
| so there might be something wrong with the wildcard.
| As far as I can understand the analyser should be use exactly the same
| 
| in both index and query time.
| 
| I have tested with changing the order of the filters (putting the 
| ISOLatin1AccentFilterFactory on top) without any result.
| 
| Could anybody help me with that and point me what may be wrong with my
| 
| shema ?

Mime
View raw message