lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DM Smith (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
Date Sun, 29 Mar 2009 16:13:50 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693579#action_12693579
] 

DM Smith commented on LUCENE-1581:
----------------------------------

bq.Why do this?
Lucene has a bias toward English texts and does not have a fundamental architecture focused
on internationalization and localization. IMHO, it should.

Java does not implement Unicode well and does not keep abreast with it's changes. It's not
that ICU is the right solution. It is *a* robust solution.

bq. What prevents you in your application from creating such a filter?
Nothing at all. But I think that proper behavior regarding Unicode and locales is something
that many want. Especially for non-English indexes. As such it belongs with Lucene not individual
projects.

With that in mind, I think it would be great if Lucene were fully internationalized and localized,
at least from a fundamental architecture perspective. (There is a separate issue on what core
and contrib should be. I'm not clear where "analyzers" fall wrt that.)

As an implementation, if ICU is present it is used, with potential performance impacts, if
not behavior degrades predictably and gracefully. This would create a quasi dependency not
a hard one.

> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-1581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1581
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it
would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
> 	public class SomeAnalyzer : Analyzer
>     	{
> 		public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
> 	        {
>             		TokenStream t = new SomeTokenizer(reader);
> 		        t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
> 			t = new LowerCaseFilter(t);
> 		        return t;
> 		}
>         
>     	}
> {code}
> 	
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
> 	"i" (if locale is "en-US") 
> 	or 
> 	"ı' if(locale is "tr-TR") (that means,this token should be input to another instance
of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better
approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo
CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
> 		
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message