lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Karsten Konrad" <Karsten.Kon...@xtramind.com>
Subject AW: AW: N-gram layer and language guessing
Date Tue, 03 Feb 2004 12:36:35 GMT

>>
Karsten, what specifics can you tell us about the algorithms? 
>>

I can not give them to Open Source, but this is what I can tell:
Heavy use of ngrams, vector space cosine with vector space
warping based on IDF and category distribution, secondary learning 
algorithm for estimating errors based on information available.

We use ngram-based algorithms a lot for classification and
clustering of texts, so the language detection was a nice 
by-product. But then, a lot of language guessers are quite 
good above 5 words.

>>
And it 
is the query I want to detect the laguange of when stemming.
>>

If you use ngrams consistently, you can leave out stemming and spend
your time with something different (like buing a bigger harddisc for
your indexes, you probably will need them then :)

>>
I'm going to take a look at Weka tonight and see if there I could 
implement something like this for Lucene.
>>

Just one thing: Ngrams and Support Vector Machines don't go together very 
well - you will need a *very* fast machine if you use high-dimensional
ngram vector spaces together with a slow generic learning algorithm.
When you combine 4-grams/5-grams and word ngrams, one million different 
dimensions are not unusual for text collections with less than 10000
documents. And, with ngrams, there usually are no insignificant dimensions.

 Mit freundlichen Grüßen aus Saarbrücken

--

Dr.-Ing. Karsten Konrad
Head of Artificial Intelligence Lab

XtraMind Technologies GmbH
Stuhlsatzenhausweg 3
D-66123 Saarbrücken
Phone: +49 (681) 3025113
Fax: +49 (681) 3025109
konrad@xtramind.com
www.xtramind.com


-----Ursprüngliche Nachricht-----
Von: karl wettin [mailto:kalle@snigel.dnsalias.net] 
Gesendet: Dienstag, 3. Februar 2004 12:58
An: Lucene Developers List
Betreff: Re: AW: N-gram layer and language guessing


On Tue, 03 Feb 2004 12:47:06 +0100
Andrzej Bialecki <ab@getopt.org> wrote:

> Karsten Konrad wrote:
> > The guesser uses only tri- and quad-grams and is based on
> > a sophisticated machine learning algorithm instead of a raw 
> > TF/IDF-weighting. The upside of this is the "confidence" value for 
> > estimating how much you can trust the classification. The downside 
> > is the model size: 5MB for 15 languages, which comes mostly from 
> > using quad-grams - our machine learners don't do feature selection 
> > very well.
> 
> Impressive. For comparision, my language models are roughly 3kB per
> language, and the guesser works with nearly perfect accuracy for texts
> 
> longer than 10 words. Below that - it depends.. :-)

Impressive indeed. However, it is quite important that one can detect the language of a query:
a query is not very often 10 words. And it 
is the query I want to detect the laguange of when stemming.

Karsten, what specifics can you tell us about the algorithms? 

I'm going to take a look at Weka tonight and see if there I could implement something like
this for Lucene.



kalle

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message