lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: How to support stemming and case folding for english content mixed with non-english content?
Date Thu, 04 Jun 2009 12:52:33 GMT
uwe what KK needs here is 'proper unicode handling'.

since the latest WordDelimiterFilter has pretty good handling of unicode
categories, combining this with WhiteSpaceTokenizer effectively gives you a
pretty good solution for unicode tokenization.

KK doesn't need detection of anything, the porter stem filter will simply
leave the indic text alone... so it will just work.

On Thu, Jun 4, 2009 at 8:40 AM, Uwe Schindler <uwe@thetaphi.de> wrote:

> > I request Uwe to give me some more ideas on using the analyzers from solr
> > that will do the job for me, handling a mix of both english and non-
> > english content.
>
> Look here:
>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h
> tml<http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-summary.h%0Atml>
>
> As you see, the Solr analyzers are just standard Lucene analyzers. So you
> can drop the solr core jar into your project and just use them :-)
>
> Currently I am not sure which one is the analyzer Robert means, that can do
> english stemming and detecting non-english parts, but there is to look for
> it.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Robert Muir
rcmuir@gmail.com

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message