lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Rowe <sar...@gmail.com>
Subject Re: Is StandardAnalyzer good enough for multi languages...
Date Wed, 09 Jan 2013 06:25:44 GMT
Dude.  Go look.  It allows for per-script specialization, with (non-UAX#29) specializations
by default for Thai, Lao, Myanmar and Hewbrew.  See DefaultICUTokenizerConfig.  It's filled
with exactly the opposite of what you were describing. 

ICUTokenizerFactory's customizability has been enhanced in to-be-released Lucene/Solr 4.1:
<https://issues.apache.org/jira/browse/SOLR-4123> - you can provide per-script RuleBasedBreakIterator
specification files at runtime. 

On Jan 9, 2013, at 12:12 AM, Trejkaz <trejkaz@trypticon.org> wrote:

> On Wed, Jan 9, 2013 at 10:57 AM, Steve Rowe <sarowe@gmail.com> wrote:
>> Trejkaz (and maybe Sai too): ICUTokenizer in Lucene's icu module may be be of interest
to you, along with the token filters in that same module. - Steve
> 
> ICUTokenizer sounds like it's implementing UAX #29, which is exactly
> the standard filled with all the issues I was describing. Unless it
> does more than that, I would recommend against using that also.
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message