lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Diego Cassinera" <diego.cassin...@mercadolibre.com>
Subject RE: Indexing accented characters, then searching by any form
Date Tue, 25 Nov 2008 16:26:06 GMT
Are you sure you are creating the fields with Field.Index.ANALYZED ?

-----Mensaje original-----
De: Dora [mailto:julien.barret@gmail.com] 
Enviado el: martes, 25 de noviembre de 2008 12:22 p.m.
Para: java-user@lucene.apache.org
Asunto: Re: Indexing accented characters, then searching by any form




Karl Wettin wrote:
> 
> Try this (dry coded) snippet instead:
> 
> StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      return new ISOLatin1AccentFilter(super.tokenStream(fieldName,  
> reader));
>    }
> }
> 

I tried this, but it does not work as expected.

I am using an utility class with a static method that gives me an analyzer:

public static Analyzer getAnalyzer() 
	{  
		StandardAnalyzer objAnalyzer = new StandardAnalyzer() {
			   public TokenStream tokenStream(String fieldName, Reader reader) {
			     return new ISOLatin1AccentFilter(super.tokenStream(fieldName,
reader));
			   }
			};
			return objAnalyzer;
		}
	}

So when I need the analyzer (for indexing or searching) I perform an
UtilityClass.getAnalyzer() call.

It works for my query parser: The accent are correctly removed when
performing the search.
If my index contains "cafe" searching for "café" will find the documents
containing "cafe"

But when explore my index with Luke I can see that the indexer does not use
the ISOLatin1AccentFilter  (I tested with a breakpoint in the overriden
tokenStream method) and if the document contains "café", the index will
contain "café".

As a consequence, search on word having accent is not possible: the index
contains the accent, while it is removed by the search process.

So my index contains "café", but when I search for "café" the filter changes
it in "cafe" and it gives no hit...

Any clue on why my filter is not used at time of indexation ?




-- 
View this message in context: http://www.nabble.com/Indexing-accented-characters%2C-then-searching-by-any-form-tp15412778p20682548.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message