lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <ka...@snigel.dnsalias.net>
Subject Re: AW: N-gram layer and language guessing
Date Fri, 06 Feb 2004 06:57:59 GMT
On Tue, 3 Feb 2004 11:39:40 +0100
"Karsten Konrad" <Karsten.Konrad@xtramind.com> wrote:

> 
> Anyway, XtraMind's ngram language guesser gives the following 
> best five results on the swedish examples discussed previously:
> 
> "jag heter kalle"
> 
> swedish 100,00 %
> norwegian 17,51 %
> danish 10,02 %
> africaans 9,53 %
> dutch 8,79 %
> 
> "vad heter du"
> 
> swedish 100,00 %
> dutch 20,97 %
> norwegian 14,68 %
> danish 11,07 %
> africaans 9,29 %


I spent all my time working on a better language guesser rather than
building the stemmer. The results I got from Weka are OK, but due to
the amount of calculations needed to guess the lagnuage of even the
shortest of strings, it is not possible for me to use these alogrithms.

Instead I'll do some experiments with markov-chains on the n-grams.
Hopefully this will yield quite a distinct difference between languages
without wating to many clockticks.

Any thoughts onthe subject is welcome.

I'll get back with results.

-- 

kalle


-- 

kalle

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message