lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: N-gram layer
Date Tue, 03 Feb 2004 09:29:55 GMT
karl wettin wrote:

> On Tue, 03 Feb 2004 09:27:25 +0100
> Andrzej Bialecki <ab@getopt.org> wrote:
> 
> 
>>If I run the above example, I get the following:
>>
>>  "jag heter kalle"
>><input> - SV:   0.7197875
> 
> 
> What is index 1.0 ?

1.0 - completely dissimilar language profiles
0.0 - completely similar language profiles

However, it is not a pure cosine measure of two vectors (input text and 
language profile) in n-gram space. I had to do some tricky tuning, too...

Getting good results for such short texts using just statistical 
analysis is largely guessing, heuristics, a bit of cheating, and a good 
portion of pure luck... IOW, just magic. :-)


-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message