lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
Date Sat, 13 Jun 2009 03:17:07 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719069#action_12719069
] 

Robert Muir commented on LUCENE-1581:
-------------------------------------

For reference, I think the concept of LowerCaseFilter, either with or without Locale is incorrect
for lucene when the intent is really to erase case differences.

There is an important distinction between converting to lowercase (for presentation), and
erasing case differences (for matching and searching).

Here is an example from the unicode std:
Characters may also have different case mappings, depending on the context. For example,
U+03A3 "Σ" greek capital letter sigma lowercases to U+03C3 "σ" greek small letter
sigma if it is followed by another letter, but lowercases to U+03C2 "ς" greek small
letter final sigma if it is not.

The only correct methods to erase case differences are:
1) Localized (for a specific language): use a collator as recommended here.
2) Multilingual (for a mix of languages): use either the UCA (collator with ROOT locale) or
unicode case-folding, either of which is only an approximation of the language-specific rules
involved.

thanks!


> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-1581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1581
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
>         Attachments: TestTurkishCollation.java
>
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it
would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
> 	public class SomeAnalyzer : Analyzer
>     	{
> 		public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
> 	        {
>             		TokenStream t = new SomeTokenizer(reader);
> 		        t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
> 			t = new LowerCaseFilter(t);
> 		        return t;
> 		}
>         
>     	}
> {code}
> 	
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
> 	"i" (if locale is "en-US") 
> 	or 
> 	"ı' if(locale is "tr-TR") (that means,this token should be input to another instance
of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better
approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo
CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
> 		
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message