cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ugo Cei <u....@pronetics.it>
Subject Re: [OT] Determining the similarity between a pair of texts
Date Wed, 15 Jun 2005 15:26:17 GMT
Il giorno 15/giu/05, alle 16:32, Tony Collen ha scritto:

> Ugo,
>
> I think what you're looking for is the Levenshtein Distance Algorithm.
>
> http://www.google.com/search? 
> hl=en&q=java+Levenshtein+implementation&btnG=Google+Search

Nice! I also found an implementation nearby:

http://jakarta.apache.org/commons/lang/api/org/apache/commons/lang/ 
StringUtils.html#getLevenshteinDistance(java.lang.String,%20java.lang.St 
ring)

;)

However, this algorithm is useful for finding single-character  
differences, whereas I am more interested in word differences. IOW, the  
LD between "test" and "tent" is 1 and the LD between "test" and "barf"  
is 4, but for my purpose it should be 1 in both cases. And the LD  
between "test case" and "tent base" is smaller than the one between  
"test case" and "case under test", but I need it to be the reverse.

Actually, what I am trying to come up is an algorithm for determining  
whether two texts refer (more or less) about similar subjects.

	Ugo

-- 
Ugo Cei
Tech Blog: http://agylen.com/
Open Source Zone: http://oszone.org/
Wine & Food Blog: http://www.divinocibo.it/


Mime
View raw message