lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: AW: N-gram layer and language guessing
Date Tue, 03 Feb 2004 11:47:06 GMT
Karsten Konrad wrote:
> Hi,
> 
> does anybody here use a ngram-layer for fault-tolerant searching 
> on *larger* texts? I ask because you can expect to see far more 
> ngrams than words emerging from a text once you use at least
> quad-grams - and the number of different tokens indexed seems to 
> be the most important parameter for Lucene's search speed.
> 
> Anyway, XtraMind's ngram language guesser gives the following 
> best five results on the swedish examples discussed previously:
> 
> "jag heter kalle"
> 
> swedish 100,00 %
> norwegian 17,51 %
> danish 10,02 %
> africaans 9,53 %
> dutch 8,79 %
> 
> "vad heter du"
> 
> swedish 100,00 %
> dutch 20,97 %
> norwegian 14,68 %
> danish 11,07 %
> africaans 9,29 %
> 
> The guesser uses only tri- and quad-grams and is based on
> a sophisticated machine learning algorithm instead of a raw
> TF/IDF-weighting. The upside of this is the "confidence" 
> value for estimating how much you can trust the 
> classification. The downside is the model size: 5MB for 15 
> languages, which comes mostly from using quad-grams - our 
> machine learners don't do feature selection very well.

Impressive. For comparision, my language models are roughly 3kB per 
language, and the guesser works with nearly perfect accuracy for texts 
longer than 10 words. Below that - it depends.. :-)

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message