lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <Karsten.Kon...@xtramind.com>
Subject AW: AW: N-gram layer and language guessing
Date Fri, 06 Feb 2004 09:43:52 GMT

>>
Instead I'll do some experiments with markov-chains on the n-grams. Hopefully this will yield
quite a distinct difference between languages without wating to many clockticks.
>>

This approach can work, but will require lots more of training examples.

If you are interested in guessing the language of a query only, one simple
approach would be to use unstemmed, language-separated indexes. Simply
look the words up using the Lucene IndexReader; wherever you find unstemmed 
words of the query, it may be worthwhile to stemm the query in that language 
and search over the (stemmed) index of that language again.

This requires either redundant indexes (stemmed/unstemmed for each language) 
or a manipulation of the analyzers such that you index both stemmed and 
unstemmed versions of the same word. 

Regards,

Karsten


-----Urspr√ľngliche Nachricht-----
Von: karl wettin [mailto:kalle@snigel.dnsalias.net] 
Gesendet: Freitag, 6. Februar 2004 07:58
An: Lucene Developers List
Betreff: Re: AW: N-gram layer and language guessing


On Tue, 3 Feb 2004 11:39:40 +0100
"Karsten Konrad" <Karsten.Konrad@xtramind.com> wrote:

> 
> Anyway, XtraMind's ngram language guesser gives the following
> best five results on the swedish examples discussed previously:
> 
> "jag heter kalle"
> 
> swedish 100,00 %
> norwegian 17,51 %
> danish 10,02 %
> africaans 9,53 %
> dutch 8,79 %
> 
> "vad heter du"
> 
> swedish 100,00 %
> dutch 20,97 %
> norwegian 14,68 %
> danish 11,07 %
> africaans 9,29 %


I spent all my time working on a better language guesser rather than building the stemmer.
The results I got from Weka are OK, but due to the amount of calculations needed to guess
the lagnuage of even the shortest of strings, it is not possible for me to use these alogrithms.

Instead I'll do some experiments with markov-chains on the n-grams. Hopefully this will yield
quite a distinct difference between languages without wating to many clockticks.

Any thoughts onthe subject is welcome.

I'll get back with results.

-- 

kalle


-- 

kalle

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message