lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From karl wettin <>
Subject Re: AW: N-gram layer and language guessing
Date Tue, 03 Feb 2004 11:57:34 GMT
On Tue, 03 Feb 2004 12:47:06 +0100
Andrzej Bialecki <> wrote:

> Karsten Konrad wrote:
> > The guesser uses only tri- and quad-grams and is based on
> > a sophisticated machine learning algorithm instead of a raw
> > TF/IDF-weighting. The upside of this is the "confidence" 
> > value for estimating how much you can trust the 
> > classification. The downside is the model size: 5MB for 15 
> > languages, which comes mostly from using quad-grams - our 
> > machine learners don't do feature selection very well.
> Impressive. For comparision, my language models are roughly 3kB per 
> language, and the guesser works with nearly perfect accuracy for texts
> longer than 10 words. Below that - it depends.. :-)

Impressive indeed. However, it is quite important that one can detect
the language of a query: a query is not very often 10 words. And it 
is the query I want to detect the laguange of when stemming.

Karsten, what specifics can you tell us about the algorithms? 

I'm going to take a look at Weka tonight and see if there I could
implement something like this for Lucene.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message