cocoon-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefano Mazzocchi <stef...@apache.org>
Subject Re: [OT] Determining the similarity between a pair of texts
Date Wed, 15 Jun 2005 19:10:29 GMT
Ugo Cei wrote:
> Il giorno 15/giu/05, alle 18:27, Stefano Mazzocchi ha scritto:
> 
>> I've been working on this for the past few months. There is no clearcut
>> solution, but using LSI is probably the best approach for the above
> 
> 
> LSI == ?

latent semantic indexing

>> As for string distance, you might want to check out  secondstring.sf.net.
> 
> There are lots of algorith there 
> <http://secondstring.sourceforge.net/javadoc/com/wcohen/secondstring/
> package-summary.html>. Which one would you suggest before I start 
> trying them all one by one?

it really depends on what you need to do. A simple first order
Levenstein can make you go a long way, if you don't know the
distribution of the frequencies of the characters in advance. Otherwise,
 you get better results, but at the price of being less effective if the
statistical properties of your data streams start changing over time.

-- 
Stefano.


Mime
View raw message