lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <ka...@snigel.dnsalias.net>
Subject Re: N-gram layer
Date Tue, 03 Feb 2004 07:35:41 GMT
On Mon, 2 Feb 2004 20:10:57 +0100
"Jean-Francois Halleux" <halleux.jf@skynet.be> wrote:

> during the past days, I've developped such a language guesser myself
> as a basis for a Lucene analyzer. It is based on trigrams. It is
> already working but not yet in a "publishable" state. If you or others
> are interested I can offer the sources.

I use variable gramsize due to the toughness of detecting thelanguage of
very small texts such as a query. For instance, applying bi->quadgram on
the swedish sentance "Jag heter Karl" (my name is Karl) is presumed to
be in Afrikaans. Using uni->quadgram does the trick.

Also, I add peneltys for gram-sized words found the the text but not in
the classified language. This improved my results even more. 

And I've been considering applying markov-chains on the grams where it
still is hard to guess the language, such as Afrikaans vs. Dutch and
American vs. Brittish English.

Let me know if you want a copy of my code. 


Here is some testoutput:

test = "jag heter kalle." 

WITH SINGLE WORD PENALTYS:

uni->quad-gram

test has a weight of 1600 in Swedish
test has a weight of 1848 in Afrikaans
test has a weight of 1928 in Dutch
test has a weight of 2021 in Danish
test has a weight of 2011 in Norwegian

bi->quad-gram

test has a weight of 1024 in Swedish
test has a weight of 1199 in Afrikaans
test has a weight of 1356 in Dutch
test has a weight of 1376 in Danish
test has a weight of 1434 in Norwegian

tri-gram only

test has a weight of 190 in Norwegian
test has a weight of 212 in Afrikaans
test has a weight of 221 in Swedish
test has a weight of 236 in Danish
test has a weight of 237 in Dutch


WITHOUT SINGLE WORD PENALTY:

uni->quad-gram

test has a weight of 1448 in Afrikaans
test has a weight of 1528 in Dutch
test has a weight of 1600 in Swedish
test has a weight of 1611 in Norwegian
test has a weight of 1621 in Danish

bi->quad-gram

test has a weight of 799 in Afrikaans
test has a weight of 956 in Dutch
test has a weight of 976 in Danish
test has a weight of 1024 in Swedish
test has a weight of 1034 in Norwegian

tri-gram only

test has a weight of 190 in Norwegian
test has a weight of 212 in Afrikaans
test has a weight of 221 in Swedish
test has a weight of 236 in Danish
test has a weight of 237 in Dutch


As you see, single word penalty on uni->quad does the trick on even the
smallest of textstrings.



karl



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message