lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Paul Hill <p...@metajure.com>
Subject RE: Is StandardAnalyzer good enough for multi languages...
Date Tue, 08 Jan 2013 23:54:58 GMT
The ICU project ( http://site.icu-project.org/ ) has Analyzers for Lucene and it has been ported
to ElasticSearch.  Maybe those integrate better.

As to not doing some tokenization, I would think an extra tokenizer in you chain would be
just the thing.

-Paul

> -----Original Message-----
> From: Trejkaz [mailto:trejkaz@trypticon.org]
> Sent: Tuesday, January 08, 2013 3:44 PM
> To: java-user@lucene.apache.org
> Subject: Re: Is StandardAnalyzer good enough for multi languages...
> 
> On Wed, Jan 9, 2013 at 6:30 AM, saisantoshi <saisantoshi76@gmail.com> wrote:
> > DoesLucene StandardAnalyzer work for all the languagues for tokenizing
> > before indexing (since we are using java, I think the content is
> > converted to UTF-8 before tokenizing/indeing)?
> 
> No. There are multiple cases where it chooses not to break something which it should
break. Some of
> these cases even result in undesirable behaviour for English, so I would be surprised
if there were even a
> single language which it handles acceptably.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Mime
View raw message