lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Hill <p...@metajure.com>
Subject RE: Is StandardAnalyzer good enough for multi languages...
Date Wed, 09 Jan 2013 18:27:03 GMT
There is often the possibility to put another tokenizer in the chain to create a variant analyzer.
 This NOT very hard at all in either Lucene or ElasticSearch. 
Extra tokenizers can often be used to tweak the overall processing to add a late tokenization
to overcome an overlooked tokenization (break on colon would be a simple example).  Adding
a tokenizer before others can change a token that seem incorrectly  processed into one that
is done how you like.

Trejkaz, I haven't tried to use ICU yet, but what I understand, I think you'll find that ICU
is more in agreement with your views and embraces the idea of refining the tokenization etc.
as needed, not relying on the curios (and often flawed) choices of some design committee somewhere.
 

 [ICU]
> -----Original Message-----
> ... no specialisation for straight Roman script, but I guess it could
> always be added.

That would be one of the main points of the whole ICU infrastructure.

-Paul 


Mime
View raw message