lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nick Menere <nick.men...@atlassian.com>
Subject Russian Tokenizing
Date Tue, 19 Jun 2007 05:04:03 GMT
Hi guys,

Nick from Atlassian here.  We had a customer complain that they could 
not search on numbers when using Russian as there indexing language.

I tracked this down to the RussianLetterTokenizer.
This extends the CharTokenizer and basically tokenizes on anything that 
isn't a letter - Character.isLetter() or is not included in a char array 
that is passed in the constructor.  It effectively will ignore numbers.

We were passing in the RussianCharsets.UnicodeRussian charset to the 
constructor.
I can get around this issue by adding the chars 0-9 to the passed in 
char set.

 From what I can tell, there shouldn't be any side-effects to this. 
Though I don't think this is the correct solution.

What I am wondering is there any reason why they didn't use the 
StandardTokenizer with an extended char set?  And is this something we 
should look at fixing?  Not speaking Russian, I can't tell if this is 
the correct way to do it.
They would then benefit from the greater functionality provided by the 
StandardTokenizer.

I have also notice some other languages go down this path.  E.g. Greek

Cheers,
Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message