lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <>
Subject [jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
Date Sun, 29 Mar 2009 10:52:50 GMT


Shai Erera commented on LUCENE-1581:

>From the javadocs (

_In general, String.toLowerCase() should be used to map characters to lowercase. String case
mapping methods have several benefits over Character case mapping methods. String case mapping
methods can perform locale-sensitive mappings, context-sensitive mappings, and 1:M character
mappings, whereas the Character case mapping methods cannot._

So I agree this is a problem, but I see no easy way (and efficient) to fix it. Suppose that
we allow LowerCaseFilter to accept Locale. What would it do with it? We could add in LowerCaseFilter
a Map<Locale, char[65536]> and allow one to pass in the Locale. If it's not null, and
there's an entry in the map, lookup every character the filter receives. The lookup will be
quite fast, as the character will serve as the index to the array (notice that we cover only
2-byte characters though) and if it's \uFFFF we can assume there's no special handling and
call Character.toLowerCase.

That is very fragile though as it's not easy to cover all the special case characters. Also,
every time (including this one) we will find a special character that was not handled properly
by the filter, it'd break back-compt, no?

BTW, when characters are uppercase, I don't think we have a problem, as they will always be
lowercased to a single character (even if it's the wrong one, it will be consistent in indexing
and search). The problem comes with the lowercase characters. The character \u0131 (lowercase
I in Turkish) is lowercased to \u0131, while its uppercase version (I) is lowercased to 'i'.
Therefore there is a mismatch and we'll fail if the user will enter a lowercase query (as
they often do).

> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>                 Key: LUCENE-1581
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it
would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
> 	public class SomeAnalyzer : Analyzer
>     	{
> 		public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
> 	        {
>             		TokenStream t = new SomeTokenizer(reader);
> 		        t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
> 			t = new LowerCaseFilter(t);
> 		        return t;
> 		}
>     	}
> {code}
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
> 	"i" (if locale is "en-US") 
> 	or 
> 	"ı' if(locale is "tr-TR") (that means,this token should be input to another instance
of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better
approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo
CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message