lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <ka...@snigel.dnsalias.net>
Subject Re: AW: N-gram layer and language guessing
Date Tue, 03 Feb 2004 11:57:34 GMT
On Tue, 03 Feb 2004 12:47:06 +0100
Andrzej Bialecki <ab@getopt.org> wrote:

> Karsten Konrad wrote:
> > The guesser uses only tri- and quad-grams and is based on
> > a sophisticated machine learning algorithm instead of a raw
> > TF/IDF-weighting. The upside of this is the "confidence" 
> > value for estimating how much you can trust the 
> > classification. The downside is the model size: 5MB for 15 
> > languages, which comes mostly from using quad-grams - our 
> > machine learners don't do feature selection very well.
> 
> Impressive. For comparision, my language models are roughly 3kB per 
> language, and the guesser works with nearly perfect accuracy for texts
> 
> longer than 10 words. Below that - it depends.. :-)

Impressive indeed. However, it is quite important that one can detect
the language of a query: a query is not very often 10 words. And it 
is the query I want to detect the laguange of when stemming.

Karsten, what specifics can you tell us about the algorithms? 

I'm going to take a look at Weka tonight and see if there I could
implement something like this for Lucene.



kalle

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message