lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "DM Smith (JIRA)" <j...@apache.org>
Subject [jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.
Date Sun, 29 Mar 2009 11:49:51 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693545#action_12693545
] 

DM Smith commented on LUCENE-1581:
----------------------------------

This a bit larger of a problem. It also pertains to upper casing, too.

I don't remember exactly, but I seem to remember that Java is behind with regard to the Unicode
spec and Locale support (e.g. it does not include fa, farsi). I find that ICU4J keeps current
with the spec.

I don't remember which way it goes, maybe it's both, but some Locales don't have a corresponding
upper or lower case for some characters.

I'm not sure, but I think efficiency pertains to how it is normalized in Unicode (e.g. NFC,
NFKC, NFD, or NFKD). These might produce different performance results.

(It is a different issue, but it is critical that the search requests perform the same Unicode
normalization as the indes. As Lucene does not have these normalization filters, I find, I
have to do this externally in my own filters using ICU.)

(Again a different issue: Another related kind of folding is that of base 10 number shaping.)

Regarding: 
bq. I see no easy way (and efficient) to fix it. Suppose that we allow LowerCaseFilter to
accept Locale. What would it do with it?

I think that we need Upper and Lower case filters that operates on the token as a whole, using
the string-level method to do case conversion.

What I'd like to see is that lucene has a pluggable way to handle ICU, in so far as it does
Locale specific things such as this. Such as using a base class UpperCaseFolder that provides
the Java implementation, but that can take an alternate implementation, perhaps by reflection.




> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-1581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1581
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think that it
would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
> 	public class SomeAnalyzer : Analyzer
>     	{
> 		public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader)
> 	        {
>             		TokenStream t = new SomeTokenizer(reader);
> 		        t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
> 			t = new LowerCaseFilter(t);
> 		        return t;
> 		}
>         
>     	}
> {code}
> 	
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
> 	"i" (if locale is "en-US") 
> 	or 
> 	"ı' if(locale is "tr-TR") (that means,this token should be input to another instance
of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better
approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo
CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
> 		
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message