lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Raghu Ram" <raghuram.nadimi...@gmail.com>
Subject Re: Language identification ??
Date Fri, 14 Mar 2008 15:24:41 GMT
to complicate it further ... the text for which language identification has
to be done is small, in most cases a short sentence like " I like Pepsi ".
Can something be done for this ?

On Fri, Mar 14, 2008 at 8:18 PM, Mathieu Lecarme <mathieu@garambrogne.net>
wrote:

> Itamar Syn-Hershko a écrit :
> > For what it worths, I did something similar in my BidiAnalyzer so I can
> > index both Hebrew/Semitic texts and English/Latin words without
> switching
> > analyzers, giving each the proper treatment. I did it simply by testing
> the
> > first char and looking at its numeric value - so it falls between Hebrew
> > Aleph and Taph then its Hebrew, else its Latin. I wonder how you would
> spot
> > a French word in an English text for instance (aren't there parallel
> words?)
> >
> > Itamar.
> With ngram statistic compare.
> Finding foreign word in a sentence is very difficult, many words are
> very similar, and some are "faux amis" : same differents means in each
> language.
> Querying in mixing language seems to be a bit vicious. Mixing alphabet
> is more common (and easier to handle).
>
> M.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message