lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Erera <ser...@gmail.com>
Subject Re: Language Detection for Analysis?
Date Thu, 06 Aug 2009 20:05:52 GMT
Robert - can you elaborate on what you mean by "just treat it at the script
level"?

On Thu, Aug 6, 2009 at 10:55 PM, Robert Muir <rcmuir@gmail.com> wrote:

> Bradford, there is an arabic analyzer in trunk. for farsi there is
> currently a patch available:
> http://issues.apache.org/jira/browse/LUCENE-1628
>
> one option is not to detect languages at all.
> it could be hard for short queries due to the languages you mentioned
> borrowing from each other.
> but you do not want to apply things like stemming to the wrong language.
>
> instead, you could use ArabicTokenizer + ArabicNormalizationFilter +
> PersianNormalizationFilter and just treat it at the script level.
>
> On Thu, Aug 6, 2009 at 3:46 PM, Bradford
> Stephens<bradfordstephens@gmail.com> wrote:
> > Hey there,
> >
> > We're trying to add foreign language support into our new search
> > engine -- languages like Arabic, Farsi, and Urdu (that don't work with
> > standard analyzers). But our data source doesn't tell us which
> > languages we're actually collecting -- we just get blocks of text. Has
> > anyone here worked on language detection so we can figure out what
> > analyzers to use? Are there commercial solutions?
> >
> > Much appreciated!
> >
> > --
> > http://www.roadtofailure.com -- The Fringes of Scalability, Social
> > Media, and Computer Science
> >
>
>
>
> --
> Robert Muir
> rcmuir@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message